ProBench: A New Benchmark for Evaluating Multimodal AI Models

Top post
Multimodal AI Models Put to the Test: ProBench Evaluates Domain-Specific Capabilities
The development of artificial intelligence is progressing rapidly, particularly in the field of multimodal models. These models, capable of processing both text and images, promise a new era of human-computer interaction and open up diverse application possibilities across various industries. But how does one measure the actual performance of these complex systems? A new benchmark called ProBench aims to do just that, by evaluating the capabilities of multimodal foundation models in practical, domain-specific tasks.
ProBench was developed to explore the limits of current AI models and to identify areas where further research is needed. In contrast to conventional benchmarks, which are often based on standardized datasets, ProBench focuses on open-ended questions that require professional expertise and advanced thinking skills. The questions come directly from experts in various disciplines and reflect their everyday challenges.
The benchmark comprises 4,000 carefully selected examples submitted by experts from ten different fields and 56 sub-fields. The spectrum ranges from natural sciences and art to humanities and programming, as well as mathematics and creative writing. This broad coverage enables a comprehensive evaluation of the models in different contexts and covers both theoretical knowledge and practical application.
For the evaluation of the models, ProBench uses the "MLLM-as-a-Judge" method. An MLLM evaluates the answers of other models to the posed questions. This approach allows for automated and scalable evaluation, while minimizing the subjectivity of human judgment. In an initial study, 24 current models, including both open-source and proprietary solutions, were tested with ProBench.
The results of the study show that while the best open-source models can compete with the proprietary models, all tested models still have significant difficulties with complex tasks. In particular, significant weaknesses were revealed in the areas of visual perception, text comprehension, domain expertise, and advanced reasoning. These findings provide valuable clues for future research efforts and help to specifically advance the development of multimodal AI models.
ProBench represents an important step towards a comprehensive and practical evaluation of multimodal AI models. By focusing on domain-specific tasks and incorporating expert knowledge, the benchmark helps to realistically assess the actual performance of these models and promote the development of robust and reliable AI systems. For companies like Mindverse, which specialize in the development of customized AI solutions, ProBench offers a valuable resource to test and optimize their own models.
The challenges that ProBench highlights underscore the complexity of developing truly intelligent systems. At the same time, however, they also open up exciting possibilities for future research and development in the field of artificial intelligence. By continuously improving multimodal models, we can develop innovative solutions for complex problems in a wide variety of areas, thus shaping the future of work and society.
Bibliography: Yang, Y., Li, D., Wu, H., Chen, B., Liu, L., Pan, L., & Li, J. (2025). ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks. *arXiv preprint arXiv:2503.06885*. Hugging Face Papers. arxiv:2503.06885. Xuchen-Li/llm-arxiv-daily. (n.d.). *GitHub*. [Repository]. https://github.com/Xuchen-Li/llm-arxiv-daily Zhao, Y. (n.d.). *ResearchGate*. https://www.researchgate.net/scientific-contributions/Yilun-Zhao-2222924951 *arXiv*. (n.d.). http://www.arxiv.org/list/cs.CL/2025-02?skip=1750&show=250 Deng, C., Dong, X., Li, P., Liu, Y., Ma, R., Wu, F., Zhang, S., Zhang, H. (2023). *On the Advancement of Universal, Accessible, and Affordable Healthcare with Multimodal Large Language Models*. arXiv preprint arXiv:2309.10020. codefuse-ai/Awesome-Code-LLM. (n.d.). *GitHub*. [Repository]. https://github.com/codefuse-ai/Awesome-Code-LLM Arora, G., Arora, S., Khanuja, S., Sancheti, S., & Ajmera, J. (2024). IndicMMLU-Pro: Benchmarking the Indic Large Language Models. *arXiv preprint arXiv:2404.13808*. Wiehe, M. (2022). *Improving Automatic Speech Recognition for Spontaneous Child Speech* [Master’s thesis, Universität Hamburg]. https://www.inf.uni-hamburg.de/en/inst/ab/lt/teaching/theses/completed-theses/2022-ma-wiehe.pdf *arXiv Daily*. (n.d.). http://arxivdaily.com/thread/64765