Abstract

Predicting the future requires an understanding of how the world works. A model with this capability has a number of appealing applications, from robotic planning to representation learning. Predicting raw future observations, such as frames in a video, is challenging and ambiguous—a naively designed model will average together possible futures into a single, blurry prediction. Recently, this issue has been tackled by two distinct approaches: (a) latent variational variable models that explicitly model the underlying stochasticity and (b) adversarially-trained models that aim to produce naturalistic images. However, a standard latent variable model can struggle to produce realistic results, and a standard adversarially-trained model underutilizes latent variables and fails to produce diverse predictions. We show that these distinct methods are in fact complementary. Combining the two produces predictions that look more realistic to human raters and better cover the range of possible futures, outperforming previous work.

Try the Stochastic Adversarial Video Prediction (SAVP) Model

[GitHub]

Example Results

We show qualitative results of the video predictions achieved by our SAVP method, our GAN and VAE variants, and other approaches. SV2P is prior work from Babaeizadeh et al. 2017, while SVG is prior work from Denton & Fergus 2018. For the stochastic models, we show the prediction with the "best" similarity compared to the ground truth video (out of 100 samples), unless otherwise labeled. Yellow indicates predicted frames, and white indicates the conditioned frames.

BAIR action-free robot pushing dataset

[All test set]

KTH human actions dataset