Tracking the Evolution of Reasoning in Multimodal AI Models

The Leaping Reasoning Curve: Testing the Evolution of GPT and OpenAI Models

The development of large language models (LLMs) is progressing rapidly. With each new model generation, whether the GPT series or the models from OpenAI, significant improvements are evident in various areas, particularly in logical reasoning. The release of OpenAI's models marked a paradigm shift towards advanced reasoning capabilities. For example, one model surpassed human performance in the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI), a benchmark for novel problem-solving and skill acquisition.

However, ARC-AGI focuses primarily on symbolic patterns. Human thinking, on the other hand, often processes multimodal scenarios that involve both visual and linguistic information. Therefore, it is important to examine the progress of LLMs in the context of multimodal tasks as well.

To track the development of reasoning abilities in multimodal contexts, various GPT and OpenAI models were tested with complex multimodal puzzles. These puzzles require detailed visual perception combined with abstract or algorithmic thinking. The results show a clear upward trend in reasoning abilities across the different model generations, with particularly significant performance leaps between the GPT models and the transition to the OpenAI models.

Efficiency and Challenges

The increased performance, however, comes with increased computational cost. For example, an OpenAI model requires many times the computing power compared to a GPT model, raising questions about efficiency.

Despite the progress, the results also show that even advanced models still struggle with seemingly simple multimodal puzzles that require abstract thinking. Even with puzzles that require algorithmic thinking, the models' performance falls short of expectations.

Multimodal Thinking: An Outlook

The continuous development of LLMs promises further improvements in the area of multimodal thinking. Future research will focus, among other things, on optimizing the models for more complex multimodal tasks. Evaluating performance based on specific benchmarks will play an important role. In addition, the development of more resource-efficient algorithms is a central concern in order to expand the practical applicability of the models. The results of this research will contribute to further deepening the understanding of artificial intelligence and its potential for multimodal thinking.

Bibliography: - Toh, V. Y. H., Chia, Y. K., Ghosal, D., & Poria, S. (2025). The Jumping Reasoning Curve? Tracking the Evolution of Reasoning Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles. arXiv preprint arXiv:2502.01081. - Programm der EMNLP 2024. https://2024.emnlp.org/program/accepted_main_conference/ - Bubeck, S., Chandrasekaran, V., Eldan, R., Ge, R., & Lee, J. R. (2024). PUZZLES: A Benchmark for Neural Algorithmic Reasoning. arXiv preprint arXiv:2402.06798. - ML-Papers-of-the-Week. https://github.com/dair-ai/ML-Papers-of-the-Week - Multimodal-AND-Large-Language-Models. https://github.com/Yangyi-Chen/Multimodal-AND-Large-Language-Models - Wu, C. S., Tam, D., Wu, S., & Fung, P. (2024). Large Language Models are not Fair Evaluators: A Case Study on Evaluating Machine Translation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 921-935). - Bubeck, S., Chandrasekaran, V., Eldan, R., Ge, R., & Lee, J. R. (2024). Puzzles: A benchmark for neural algorithmic reasoning. arXiv preprint arXiv:2402.06798. - Proceedings of the SIGBOVIK 2024 Conference. https://www.sigbovik.org/2024/proceedings.pdf - The Evolution of Reasoning Models: Breaking Barriers in AI Thinking - Introduction. https://www.ve3.global/the-evolution-of-reasoning-models-breaking-barriers-in-ai-thinking-introduction/