OHRBench: A Benchmark for Evaluating the Impact of OCR Quality on Retrieval-Augmented Generation

The Impact of OCR Quality on Retrieval-Augmented Generation: A Look at OHRBench

Retrieval-Augmented Generation (RAG) has established itself as a promising method for extending the capabilities of large language models (LLMs). By integrating external knowledge sources, hallucinations can be reduced and up-to-date information can be incorporated without retraining the model. A central component of RAG is the creation of external knowledge bases. These are often built by extracting structured data from unstructured PDF documents using Optical Character Recognition (OCR).

However, the accuracy of OCR systems is limited, and the representation of structured data is inherently inconsistent. Therefore, knowledge bases inevitably contain various OCR errors. These errors can have a cascading effect on the performance of RAG systems, from the retrieval phase to the final generation.

OHRBench: A New Benchmark for OCR-based RAG Systems

To better understand the impact of OCR errors on RAG systems, OHRBench was developed – the first benchmark of its kind. OHRBench comprises 350 carefully selected, unstructured PDF documents from six real-world RAG application domains. These documents are supplemented by questions and answers derived from multimodal elements within the documents. This combination poses a challenge for existing OCR solutions and provides an ideal test environment to evaluate the influence of OCR on RAG.

To systematically investigate the influence of OCR errors, OHRBench identifies two main types of OCR noise:

Semantic Noise: Erroneous recognition of words that alter the meaning of the text.
Formatting Noise: Errors in the representation of tables, lists, and other structural elements.

Through targeted manipulations of the documents, datasets with varying degrees of these two error types were generated. This allows for a detailed analysis of the impact of each individual error type on the performance of the RAG system.

Evaluation of OCR Solutions and RAG Systems

Using OHRBench, a comprehensive evaluation of current OCR solutions was conducted. The results show that none of the tested solutions fully meet the requirements for building high-quality knowledge bases for RAG systems. The impact of the two identified error types on RAG systems was systematically investigated. The results demonstrate the vulnerability of RAG systems to OCR noise, both in the retrieval and generation phases.

Vision-Language Models (VLMs) as an Alternative?

Given the challenges posed by OCR errors, the potential of Vision-Language Models (VLMs) for RAG systems is also discussed. VLMs could eliminate the need for OCR by working directly with the image data of the documents. This approach could avoid the information loss and distortions introduced by the OCR process and improve the performance of RAG systems.

Conclusion and Outlook

OHRBench provides a valuable foundation for further research in the field of RAG. The benchmark enables a systematic investigation of the impact of OCR quality on RAG systems and contributes to improving the robustness and reliability of these systems. The results of the evaluation underscore the need to develop robust OCR solutions and alternative approaches, such as the use of VLMs, to address the challenges of handling unstructured documents in RAG systems. OHRBench makes an important contribution to the further development of RAG and paves the way for future innovations in this area.

```