NOVA: Non-Quantized Autoregressive Video Generation

Autoregressive Video Generation: A New Approach Without Vector Quantization

The generation of videos using artificial intelligence has made significant progress in recent years. A promising approach is autoregressive video generation, where images are generated frame by frame, sequentially. Previous autoregressive models typically rely on vector quantization to transform video data into a discrete space. However, this can lead to efficiency problems, especially when modeling longer videos.

A new research paper presents an innovative approach that enables autoregressive video generation without vector quantization. This approach reformulates video generation as non-quantized autoregressive modeling, which integrates both temporal prediction frame by frame and spatial prediction set by set. In contrast to line-by-line prediction in earlier autoregressive models or the joint distribution modeling of fixed-length tokens in diffusion models, this approach retains the causal property of GPT-like models for flexible in-context capabilities while leveraging bidirectional modeling within individual frames for greater efficiency.

Using this approach, a new video autoregressive model without vector quantization has been trained, called NOVA (NOn-Quantized Video Autoregressive Model). The results show that NOVA surpasses previous autoregressive video models in terms of data efficiency, inference speed, visual quality, and video fluidity, even with a significantly smaller model capacity of only 0.6 billion parameters.

How NOVA Works

NOVA is based on a two-stage process. In the first stage, temporal modeling, the video frames are predicted sequentially. The second stage, spatial modeling, focuses on the prediction of image sections within each individual frame. This approach allows for more efficient modeling, as bidirectional modeling is used within the frames, while maintaining the causal structure for temporal prediction.

A key aspect of NOVA is the avoidance of vector quantization. Instead of converting the video data into discrete tokens, NOVA works directly in continuous space. This eliminates the need for discrete tokenizers and reduces the information loss that can occur through quantization. This improves the quality of the generated videos and increases the inference speed.

Results and Potential

The evaluation of NOVA shows impressive results. In text-to-image generation tasks, NOVA even surpasses state-of-the-art image diffusion models at significantly lower training costs. Furthermore, NOVA achieves compelling results in text-to-video generation tasks in terms of visual quality and video fluidity. Another advantage of NOVA is the ability for video extrapolation, i.e., the generation of videos that exceed the length seen during training.

NOVA generalizes well across different video lengths and enables various zero-shot applications in a single, unified model. This means that NOVA can be applied to tasks for which it was not explicitly trained. This flexibility and efficiency make NOVA a promising approach for the next generation of video generation and the development of world models. The release of the code and models enables the reproduction of the results and further research in this area.

Outlook

The development of NOVA represents a significant advance in the field of autoregressive video generation. Avoiding vector quantization allows for more efficient and higher-quality video generation. Future research could focus on scaling the model to higher resolutions and longer video durations, as well as expanding the application possibilities to other areas such as video editing and controllable generation.

```