Evaluating Error Detection in Long Chain-of-Thought Reasoning by Large Language Models

Artificial Intelligence on Trial: Can Large Language Models Detect Errors in Long Reasoning Processes?

The rapid development of large language models (LLMs) has enabled impressive progress in recent years in areas such as text generation, translation, and even creative writing. A particularly promising approach is the so-called "Chain-of-Thought" (CoT) reasoning, in which LLMs draw logical conclusions step by step, like a human. Especially long CoT chains, which represent complex trains of thought, are increasingly becoming the focus of research. But how reliable are these long reasoning processes of LLMs, and can the models independently detect errors in their own conclusions?

To answer these questions, so-called "Critic Models" are gaining importance. These models are trained to evaluate the quality and correctness of CoT processes and to identify potential errors. A new benchmark dataset called "DeltaBench" now enables the systematic evaluation of these Critic Models and provides insights into the strengths and weaknesses of various LLMs in handling long CoT chains.

DeltaBench: A New Benchmark for Error Detection

DeltaBench comprises a collection of long CoT processes generated by various LLMs, including so-called "o1-like" models such as QwQ and DeepSeek-R1, for different task domains such as mathematics, programming, and general reasoning. The processes are annotated, marking errors and thus providing a basis for evaluating the error detection capabilities of Critic Models.

The analysis of the CoT processes collected in DeltaBench allows for a detailed examination of the effectiveness and efficiency of various LLMs. For example, differences in the length and complexity of the generated thought processes, as well as the frequency and type of errors, can be analyzed. These findings contribute to a better understanding of the strengths and weaknesses of different model architectures and training methods.

Evaluation of Critic Models and Process-Reward Models

DeltaBench serves as a basis for the comprehensive evaluation of existing Critic Models and so-called Process-Reward Models (PRMs). PRMs evaluate the quality of individual steps within a CoT process, while Critic Models assess the entire process as a whole. By comparing the performance of these models on DeltaBench, the limits and limitations of current approaches to error detection in long CoT chains can be revealed.

The results of the evaluation provide valuable information for the further development of LLMs and Critic Models. They help developers better assess the capabilities of their models in handling complex thought processes and make targeted improvements. DeltaBench thus contributes to increasing the reliability and transparency of AI systems and strengthening trust in their capabilities.

Outlook: The Future of Error Detection in LLMs

The development of robust and reliable Critic Models is essential for the successful deployment of LLMs in critical applications. DeltaBench represents an important step in this direction and provides a solid foundation for further research. Future work could focus on the development of even more powerful Critic Models that can detect even subtle errors in complex thought processes. Furthermore, extending DeltaBench to other task domains and LLM architectures is an important aspect to ensure the generalizability of the results.

Bibliography: He, Y., Li, S., Liu, J., Wang, W., Bu, X., Zhang, G., Peng, Z., Zhang, Z., Su, W., & Zheng, B. (2025). Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning?. *arXiv preprint arXiv:2502.03373*. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q. V., & Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. *Advances in Neural Information Processing Systems*, *35*, 47391-47406.