Microsoft Introduces Florence-VL: A New Generation of Multimodal AI Models

Top post
```html
Florence-VL: Microsoft's New Generation of Multimodal AI Models
Multimodal large language models (MLLMs) are an emerging field of Artificial Intelligence that aims to improve the ability of computers to understand and interact with the world by combining different data types like text and images. Microsoft Research has introduced Florence-VL, a new family of such MLLMs, based on the generative vision foundation model Florence-2.
Unlike traditional vision transformers trained through contrastive learning, such as CLIP, Florence-2 takes a generative approach. This allows Florence-2 to capture different levels and aspects of visual features. Instead of simply embedding images into a vector space representing similarities between text and image, Florence-2 can analyze and interpret images in detail, leading to more versatile applicability in various downstream tasks. These range from image description and object recognition to more complex tasks like answering questions about images or generating captions.
The integration of Florence-2's visual features into pre-trained LLMs like Phi 3.5 and Llama 3 is achieved through a novel feature fusion architecture. Microsoft calls this architecture "Depth-Breadth Fusion" (DBFusion). DBFusion combines visual features extracted from different depths of the neural network and also utilizes multiple prompts to generate a more comprehensive understanding of the image. By combining information from different layers of the network and using different descriptions of the image content, the robustness and accuracy of the model are improved.
The training of Florence-VL consists of two phases: an end-to-end pre-training of the entire model, followed by fine-tuning of the projection layer and the LLM. A carefully curated selection of open-source datasets was used for training, containing both high-quality image descriptions and instruction-tuning pairs. This combination allows the model to develop both a deep understanding of images and the ability to respond to complex instructions.
Quantitative analyses and visualizations of Florence-VL's visual features demonstrate its advantages over common vision encoders regarding vision-language alignment. The depth and breadth of the visual representation achieved by DBFusion play a crucial role. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs in various multimodal and vision-centric benchmarks. These include general Visual Question Answering (VQA), perception, hallucination, Optical Character Recognition (OCR), diagram interpretation, and knowledge-intensive understanding.
To foster future research, the models and the complete training recipe have been made open-source by Microsoft. This allows researchers and developers to build upon Microsoft's results and further advance the development of multimodal AI models. The open-source release contributes to transparency and promotes collaboration within the research community. Florence-VL represents a significant step towards more powerful and versatile AI capable of understanding and interpreting the world in a way that more closely resembles human perception.
Bibliography
Chen, J., et al. (2024). Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion. arXiv preprint arXiv:2412.04424.
Xiao, B., et al. (2023). Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. arXiv preprint arXiv:2311.06242.
Wang, P., et al. (2024). Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv preprint arXiv:2409.12191.
DirtyHarryLYL. (n.d.). LLM-in-Vision. GitHub repository. https://github.com/DirtyHarryLYL/LLM-in-Vision
Tripathi, S. (2024, June 25). Hands-on Guide to Vision Language Tasks using Microsoft’s Florence-2. ADaSci. https://adasci.org/hands-on-guide-on-vision-language-tasks-using-microsofts-florence-2/
Ellendorff, S., et al. (2024). Multimodal Language Models in Healthcare: A Systematic Overview and Evaluation. German Research Center for Artificial Intelligence (DFKI).
Liu, F., et al. (2024). Multimodal Large Language Models for Biomedical Literature Analysis and Clinical Question Answering. Proceedings of the 9th Clinical Natural Language Processing Workshop.
jingyi0000. (n.d.). VLM_survey. GitHub repository. https://github.com/jingyi0000/VLM_survey
Xiao, B., et al. (2024). Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
Avrithis, Y. (2024). Contributions to the analysis of visual data through deep representation learning. Doctoral dissertation, École Polytechnique Fédérale de Lausanne.
```