Zero-Shot Audio-Visual Speech Recognition Using Large Language Models

Overcoming Language Barriers: Zero-Shot Audio-Visual Speech Recognition with Large Language Models

Automatic speech recognition (ASR) has made enormous strides in recent years. However, the development of robust and multilingual systems remains a challenge. A promising approach to overcoming this hurdle is the use of audio-visual information combined with the capabilities of large language models (LLMs). A new research paper introduces an innovative framework called "Zero-AVSR," which enables zero-shot audio-visual speech recognition. This means that speech recognition can be performed in target languages without the need for audio-visual speech data in those languages.

The Core of Zero-AVSR: The Audio-Visual Speech Romanizer

The central component of the Zero-AVSR framework is the "Audio-Visual Speech Romanizer" (AV-Romanizer). This learns language-independent speech representations by predicting texts in romanized form. Romanization, the transliteration of texts into the Latin alphabet, serves as a universal intermediate step. By predicting romanized text, the AV-Romanizer can capture phonetic and linguistic information relevant across language boundaries.

From Romanization to the Target Language: The Cascaded Architecture

The strength of Zero-AVSR lies in the combination of the AV-Romanizer with the multilingual capabilities of LLMs. In the cascaded architecture of Zero-AVSR, the romanized text predicted by the AV-Romanizer is converted by an LLM into the target language-specific script. This two-stage process enables speech recognition in languages for which no training data exists.

Unified Architecture: Direct Integration into the LLM

In addition to the cascaded architecture, the research paper also examines a unified architecture of Zero-AVSR. Here, the audio-visual speech representations encoded by the AV-Romanizer are directly integrated into the LLM. This is achieved by fine-tuning an adapter and the LLM using a multi-task learning scheme. This approach promises an even tighter integration of audio-visual information and the language model.

MARC: A Multilingual Dataset for Training

To capture the diversity of phonetic and linguistic features, a new dataset called "Multilingual Audio-Visual Romanized Corpus" (MARC) was created. MARC comprises 2,916 hours of audio-visual speech data in 82 languages, along with transcriptions in both language-specific script and romanized form. This dataset serves as the basis for training the AV-Romanizer and enables the development of robust and cross-lingual models.

Potential for the Future of Speech Recognition

The research results show that Zero-AVSR has the potential to enable speech recognition in languages that were not previously considered in the training of the AV-Romanizer. This approach could drive the development of truly universal speech recognition systems and facilitate communication across language barriers.

Bibliography: - https://arxiv.org/abs/2503.06273 - https://arxiv.org/html/2503.06273v1 - https://github.com/JeongHun0716 - http://paperreading.club/page?id=290518 - https://huggingface.co/papers?q=visual-audio - https://x.com/arxivsound?lang=de - https://papers.cool/arxiv/cs.MM - https://www.researchgate.net/publication/373317546_AVFormer_Injecting_Vision_into_Frozen_Speech_Models_for_Zero-Shot_AV-ASR - https://github.com/halsay/ASR-arxiv-daily - https://www.researchgate.net/publication/363646824_AVATAR_Unconstrained_Audiovisual_Speech_Recognition