Perception Tokens Improve Visual Reasoning in Multimodal Language Models

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models

Multimodal language models (MLMs) continue to face challenges in basic visual perception tasks where specialized models excel. Tasks requiring reasoning about 3D structures benefit from depth estimation, and reasoning about 2D object instances benefits from object detection. However, MLMs cannot generate intermediate depths or boxes to reason over. Fine-tuning MLMs on relevant data doesn't generalize well, and outsourcing computations to specialized vision tools is too computationally and memory intensive.

To address this, perception tokens have been introduced—intrinsic image representations that serve to aid reasoning tasks where language is insufficient. Perception tokens act as additional reasoning tokens, similar to chain-of-thought prompts in language models. For instance, on a depth-related task, a perception-token-augmented MLM can reason by generating a depth map as tokens, effectively allowing it to solve the problem.

AURORA, a training method, augments MLMs with perception tokens for enhanced reasoning over visual inputs. AURORA utilizes a VQVAE to convert intermediate image representations like depth maps into a tokenized format and bounding box tokens, which are then used in a multi-task training framework.

AURORA achieves remarkable improvements on various counting benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench, outperforming fine-tuning approaches in generalizing across datasets. It also improves relative depth: over +6% on BLINK. With perception tokens, AURORA extends the scope of MLMs beyond language-based reasoning, paving the way for more effective visual reasoning capabilities.

The Importance of Perception Tokens

Perception tokens enable MLMs to go beyond mere language processing and develop a deeper understanding of visual scenes. Instead of relying solely on textual descriptions, by integrating depth information and object locations, the models can handle more complex tasks. This is particularly relevant for applications requiring spatial understanding, such as robot navigation or the interpretation of medical images.

AURORA: An Innovative Training Approach

The development of AURORA represents a significant step in the advancement of MLMs. By using a VQVAE and a multi-task training framework, the integration of perception tokens is made efficient and effective. The curriculum learning approach ensures that the model learns progressively without forgetting previously acquired knowledge. This is crucial for the stability and robustness of the training process.

Results and Outlook

The results achieved on various benchmarks demonstrate the potential of perception tokens and AURORA. The significant improvements in relative depth estimation and object counting show that MLMs become significantly more powerful by integrating visual perception information. This development opens up new possibilities for the application of MLMs in areas such as robotics, medical imaging, and virtual reality.

Research on perception tokens and AURORA is still in its early stages, but the results so far are promising. Future research could focus on expanding the scope of perception tokens to other visual tasks, as well as developing even more efficient training methods. The integration of perception tokens could be a key to developing truly intelligent multimodal systems.

Bibliography Bigverdi, M., Luo, Z., Hsieh, C.-Y., Shen, E., Chen, D., Shapiro, L. G., & Krishna, R. (2024). Perception Tokens Enhance Visual Reasoning in Multimodal Language Models. arXiv preprint arXiv:2412.03548. Bigverdi, M., Luo, Z., Hsieh, C.-Y., Shen, E., Chen, D., Shapiro, L. G., & Krishna, R. (2024). Perception Tokens Enhance Visual Reasoning in Multimodal Language Models. arXiv preprint arXiv:2412.03548v2. Ren, B., Li, X., Weber, C., Hafez, A., & Wermter, S. (2023). Multimodal Large Language Models for Robot Manipulation: A Pilot Study on Visual Prompting. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 1-8). IEEE.