Evaluating the Critical Thinking Abilities of Large Language Models with RealCritic

More Effective Evaluation of Critical Thinking Skills in Large Language Models

Large language models (LLMs) have made enormous progress in recent years. In addition to generating texts, translations, and answering questions, their ability to critically analyze is increasingly coming into focus. Constructive criticism is essential to improve the performance of LLMs, both through self-improvement and feedback to other models. However, evaluating this critical ability poses a challenge due to the open-ended nature of the task.

A new benchmark, RealCritic, aims to make the evaluation of critical thinking skills in LLMs more effective. In contrast to existing benchmarks, which typically operate in an open loop, RealCritic uses a closed loop. This means the quality of the corrections generated from the critique is evaluated. This approach allows for a more direct measurement of the effectiveness of the critique.

RealCritic also includes features such as self-critique, cross-critique, and iterative critique, which are crucial for distinguishing the capabilities of advanced reasoning models from classic LLMs. Self-critique describes a model's ability to analyze and improve its own output. Cross-critique refers to the evaluation of the output of another model. Iterative critique allows the model to refine its critique and the resulting corrections based on feedback.

The benchmark was implemented using eight challenging reasoning tasks. The results show interesting differences between classic LLMs and advanced reasoning models. Although classic LLMs show comparable performance in the direct generation of thought processes (chain-of-thought), they fall behind the advanced reasoning model o1-mini in all critique scenarios. Particularly striking is that classic LLMs can perform even worse in self-critique and iterative critique scenarios compared to their baseline capabilities.

These findings underscore the importance of a differentiated evaluation of critical thinking skills. RealCritic offers a valuable resource to drive the development and improvement of LLMs by highlighting the strengths and weaknesses of different models in terms of their critical capabilities. The closed loop and the integration of self-, cross-, and iterative critique allow for a more comprehensive and realistic evaluation compared to traditional open-loop benchmarks. The focus on the effectiveness of the critique, measured by the quality of the resulting corrections, contributes to the development of LLMs that not only generate convincing text but also think critically and improve their own output.

For Mindverse, a German company specializing in AI-powered content creation, these developments are of great importance. Improving the critical thinking skills of LLMs is crucial for the development of high-quality AI solutions, such as chatbots, voicebots, AI search engines, and knowledge systems. Precise and effective critical thinking allows for increasing the quality and reliability of the generated content and continuously optimizing the systems.

Bibliographie: Zhengyang Tang et al. “RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques.” arXiv preprint arXiv:2501.14492 (2025). https://huggingface.co/papers/2501.14492 https://huggingface.co/papers https://www.chatpaper.com/chatpaper/fr?id=3&date=1737907200&page=1 https://arxiv.org/html/2311.18702v2 https://openreview.net/forum?id=IcovaKGyMp https://arxiv.org/abs/2410.10724 https://openreview.net/forum?id=iO4LZibEqW