Large Language Models Struggle with Analogical Reasoning Under Perceptual Uncertainty

Can Large Language Models Perform Analogical Reasoning under Perceptual Uncertainty?

Artificial intelligence (AI) has made rapid progress in recent years, particularly in the field of natural language understanding. Large Language Models (LLMs) can generate texts, translate, and answer questions. But what about their ability to reason analogically, a core competence of human intelligence?

A new study investigates the performance of LLMs in the area of analogical reasoning under challenging conditions, namely perceptual uncertainty. Analogical reasoning refers to the ability to recognize relationships between concepts or objects and transfer these relationships to new, similar situations. A classic example of this are Raven's Progressive Matrices tests, where a missing figure in a matrix must be identified based on the relationships of the other figures.

The study focuses on the evaluation of two leading LLMs: OpenAI's o3-mini and DeepSeek R1. The I-RAVEN dataset and its extension I-RAVEN-X serve as benchmarks. I-RAVEN-X places higher demands on the models, as the rules are more complex and the ranges of attribute values are larger. To simulate the influence of perceptual uncertainties, I-RAVEN-X was extended. Two strategies were used: First, irrelevant attributes were added that have no influence on the correct solution, and second, the distributions of the attribute values were "smoothed" to simulate blur.

The results show that the performance of both LLMs decreases significantly under perceptual uncertainty. OpenAI's o3-mini achieved an accuracy of 86.6% in the original I-RAVEN dataset, which dropped to 17% in the more demanding I-RAVEN-X – close to chance level. DeepSeek R1 behaved similarly: Accuracy decreased from 80.6% to 23.2%. Interestingly, o3-mini required 3.4 times more reasoning tokens for I-RAVEN-X despite the worse results.

In contrast, a neuro-symbolic probabilistic abductive model called ARLC, which achieves state-of-the-art results in the I-RAVEN dataset, showed significantly higher robustness to the challenging conditions. The accuracy only dropped moderately from 98.6% to 88%.

These results demonstrate that while LLMs possess impressive capabilities in the field of language processing, they still show significant weaknesses in analogical reasoning under perceptual uncertainty. The robustness of models like ARLC suggests that neuro-symbolic approaches are promising for the development of AI systems that can reason reliably even in complex and uncertain environments. For companies like Mindverse, which specialize in the development of customized AI solutions, these findings are of great importance. The development of robust and reliable AI systems that function effectively even under real-world conditions, which are often characterized by uncertainties, is a central concern. The research results underscore the need to explore and combine different AI paradigms to overcome the limitations of current technology. This opens up new possibilities for the development of chatbots, voicebots, AI search engines, and knowledge systems that meet the requirements of complex application scenarios.

Bibliography:

Camposampiero, G., Hersche, M., Wattenhofer, R., Sebastian, A., & Rahimi, A. (2025). Can Large Reasoning Models do Analogical Reasoning under Perceptual Uncertainty? arXiv preprint arXiv:2503.11207.

Ball, L. J., & Christensen, B. T. (2009). Analogical reasoning during hypothesis generation: the effects of object and domain similarities on access and transfer. Memory & cognition, 37, 107-118.

Shetty, K., Mu, J., & Chang, M. W. (2024). Analogical reasoning enhanced fact verification. Findings of the Association for Computational Linguistics: ACL 2024, 4608-4622.