Maya: A New Open-Source Multilingual Multimodal AI Model

New Horizons for Multilingual, Multimodal AI: Maya Takes the Stage

The rapid development in the field of Artificial Intelligence (AI) continues unabated. Vision-Language Models (VLMs), models that can process both images and text, have made remarkable progress recently. A new player in this field is Maya, an open-source multimodal multilingual model that aims to bridge the gap in the processing of less common languages and diverse cultural contexts.

Maya's Origin and Focus

Maya was developed to overcome the current limitations of VLMs, which are mainly trained on large datasets that predominantly contain content in widely spoken languages. These models often have difficulty understanding nuances in less common languages and cultural contexts while remaining free of toxic language.

Maya focuses on three key points: First, a multilingual image-text pre-training dataset in eight languages was created, based on the LLaVA dataset. Second, a comprehensive analysis of toxicity within the LLaVA dataset was conducted, and a new, toxicity-free version in eight languages was developed. Third, a multilingual image-text model was developed that supports these languages, thereby improving cultural and linguistic understanding in vision-language tasks.

Technical Details and Architecture

Maya is based on the LLaVA framework and uses the Aya-23 8B model as a foundation. For image encoding, it uses SigLIP, which is characterized by its multilingual adaptability. The model supports eight languages – English, Chinese, French, Spanish, Russian, Japanese, Arabic, and Hindi – and was trained with a specially prepared dataset comprising 558,000 images with multilingual annotations. The context length is 8,000 tokens, and the model has 8 billion parameters.

Applications and Potential

Maya is designed for a variety of applications, including multilingual visual question answering, cross-cultural image understanding, image description in multiple languages, visual reasoning tasks, and document understanding. The open-source nature of the model and the dataset allows the research community to build upon it and further develop the technology.

Challenges and Limitations

Despite the promising approach, Maya is currently limited to eight languages. For optimal performance, the model requires high-quality images. It can also happen that nuanced cultural contexts are not fully captured in all cases. Performance varies depending on the language and task.

Bias, Risks, and Limitations

The developers of Maya have focused on mitigating bias and ensuring safety during the development of the model. The dataset was filtered for toxic content, and cultural sensitivity assessments were conducted. Despite these measures, users should be aware that the model may still exhibit biases present in the training data. Performance can vary in different cultural contexts. Maya is not suitable for critical decision-making.

Outlook

Maya represents an important step towards more inclusive and culturally sensitive AI models. The open availability of code, weights, and dataset offers the research community the opportunity to build on this foundation and further develop the technology. Future research could focus on expanding language support, improving performance in capturing nuanced cultural contexts, and further mitigating bias. Maya has the potential to fundamentally change the way we interact with AI and make the benefits of the technology accessible to a wider audience.

Bibliography: - https://huggingface.co/maya-multimodal/maya - https://arxiv.org/abs/2403.02677 - https://2024.aclweb.org/program/finding_papers/ - https://huggingface.co/datasets/MBZUAI/palo_multilingual_dataset - https://aclanthology.org/2024.bionlp-1.52.pdf - https://2023.emnlp.org/program/accepted_main_conference/ - https://aclweb.org/aclwiki/BioNLP_Workshop - https://bohrium.dp.tech/paper/arxiv/2406.17923 - https://www.linkedin.com/posts/maya-bsat-090a34188_llama-31-405b-70b-8b-with-multilinguality-activity-7221557956591403009-E8N7 - https://assets.amazon.science/d5/89/c941636a491bb7c8b01cbebfd8d8/effectively-fine-tune-to-improve-large-multimodal-models-for-radiology-report-generation.pdf