Neighboring Autoregressive Modeling Improves Efficiency in Visual Generation

Top post
More Efficient Visual Generation through Neighboring Autoregressive Modeling
The generation of images and videos using Artificial Intelligence has made enormous progress in recent years. Autoregressive models play a central role in this. These models generate content by predicting pixel by pixel or token by token, based on the preceding elements. A common approach is raster-based prediction, where the pixels are processed in a fixed order, similar to reading a page. However, this approach neglects the spatial and temporal proximity of visual content. Neighboring pixels or frames in a video typically exhibit stronger correlations than those far apart.
A new approach, called "Neighboring Autoregressive Modeling" (NAR), promises a remedy here. NAR formulates autoregressive visual generation as a progressive outpainting process that follows a "Next-Neighbor Prediction" mechanism. Starting from an initial token, the remaining tokens are decoded in ascending order of their Manhattan distance to the initial token in the spatio-temporal space. Simply put, the nearest neighbors of the starting point are predicted first, then their neighbors, and so on. This creates a progressively expanding region of decoded content.
To enable the parallel prediction of multiple neighboring tokens, NAR uses dimension-oriented decoding heads. Each of these heads is responsible for predicting the next token along a specific dimension. During the inference process, all tokens adjacent to the already decoded tokens are processed in parallel. This parallelism leads to a significant reduction in the required computational steps and thus to significantly faster generation.
Experimental results on image and video datasets such as ImageNet256x256 and UCF101 demonstrate the efficiency of NAR. Compared to the PAR-4X approach, another autoregressive model, NAR achieves a 2.4-fold and 8.6-fold higher throughput in generating images and videos, respectively. At the same time, NAR achieves better FID/FVD scores, which measure the quality of the generated content. NAR also shows promising results in the field of text-to-image generation. A NAR model with 0.8 billion parameters surpasses the performance of the Chameleon-7B model, despite being trained with only 40% of the training data.
The development of NAR represents an important step towards more efficient and powerful autoregressive models for visual generation. The ability to leverage the spatial and temporal locality of visual data enables faster generation while maintaining or even improving quality. Research in this area will further advance the possibilities of AI-powered content creation and open up new applications in various fields.
The combination of increased speed and high-quality results makes NAR a promising approach for future developments in the field of visual generation. The ability to generate complex images and videos faster opens up new possibilities for creative applications, as well as for the automated creation of visual content in various industries.
Bibliography: - He, Y., He, Y., He, S., Chen, F., Zhou, H., Zhang, K., & Zhuang, B. (2025). Neighboring Autoregressive Modeling for Efficient Visual Generation. arXiv preprint arXiv:2503.10696. - https://huggingface.co/papers/2503.10696 - Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2024). High-Resolution Image Synthesis with Latent Diffusion Models. arXiv preprint arXiv:2112.10752. - Chang, H., Zhang, H., Xu, J., Wang, S., & Ma, C. (2024). Parallel Auto-Regressive Modeling with a Near-to-Far Order for Visual Generation. Advances in Neural Information Processing Systems, 37. - Epiphany Team. Parallel Auto-Regressive project. https://epiphqny.github.io/PAR-project/ - Chen, F., Zhou, H., He, Y., Zhang, K., Zhuang, B., & He, Y. (2024). Controllable Autoregressive Modeling for Visual Generation. arXiv preprint arXiv:2412.04062. - He, Y., He, Y., Chen, F., Zhou, H., Zhang, K., & Zhuang, B. (2024). A Survey on Vision Autoregressive Model. arXiv preprint arXiv:2412.15119. - FoundationVision. VAR. https://github.com/FoundationVision/VAR