VITA-1.5: A Real-Time Multimodal Language Model

Multimodal Language Models: VITA-1.5 Enables Real-Time Interaction with Image and Speech

Multimodal large language models (MLLMs) have recently made significant progress, particularly in integrating visual and textual modalities. The inclusion of speech in these models, although crucial for natural human-computer interaction, presents a challenge due to the fundamental differences between the modalities. This article highlights VITA-1.5, an MLLM that combines image, text, and speech processing in real time.

The Challenge of Multimodal Integration

While MLLMs process images and text effectively, integrating speech is more complex. Visual data provides spatial information, while speech represents temporal sequences. These differences lead to conflicts in the training process, as optimizing one modality can negatively impact performance in another. Traditional speech dialogue systems use separate modules for Automatic Speech Recognition (ASR) and Text-to-Speech (TTS), resulting in latency and reduced coherence.

VITA-1.5: A Three-Stage Training Approach

VITA-1.5 addresses these challenges with a three-stage training approach. In the first stage, the focus is on image-text processing. Visual adapters are trained, and the model is fine-tuned with descriptive image captions and visual question-answering data. The second stage integrates audio processing. An audio encoder is trained with speech transcription data and then refined with speech question-answering data. In the third stage, an audio decoder is trained, enabling end-to-end speech generation, eliminating the need for external TTS modules.

Architecture of VITA-1.5

VITA-1.5 uses a "Multimodal Encoder-Adapter-LLM" architecture. The input side consists of image and audio encoders connected to an LLM via adapters. The output side features a dedicated end-to-end speech module. InternViT-300M serves as the image encoder, utilizing a dynamic patching strategy for high-resolution images. Videos are processed as a sequence of images. The audio encoder consists of downsampling convolutional layers and transformer blocks. TiCodec serves as the codec model for speech encoding and decoding.

Evaluation and Results

VITA-1.5 was evaluated using various benchmarks for image, video, and speech understanding and compared with open-source and proprietary models. The model demonstrates comparable performance in image and video processing to leading image-based MLLMs and significant improvements in speech processing. The interaction latency in speech interaction was significantly reduced, enabling near real-time communication.

Outlook

VITA-1.5 represents a significant step towards seamless multimodal interaction. The combination of image, text, and speech in real-time opens up new possibilities for human-computer interfaces and dialogue systems. The ongoing development of MLLMs like VITA-1.5 promises a future where interaction with technology becomes increasingly intuitive and natural.

Bibliographie: - https://arxiv.org/abs/2501.01957 - https://huggingface.co/papers/2501.01957 - https://arxiv.org/html/2501.01957v1 - https://github.com/VITA-MLLM/VITA - https://deeplearn.org/arxiv/564375/vita-1.5:-towards-gpt-4o-level-real-time-vision-and-speech-interaction - https://www.alphaxiv.org/abs/2501.01957 - https://x.com/javaeeeee1/status/1876230205783802331 - https://x.com/_akhaliq/status/1876121786422890618 - https://www.chatpaper.com/chatpaper/zh-CN?id=4&date=1736092800&page=1 - https://huggingface.co/collections/Giuliano/voice-677b7af08b252e571a81f4ab