FreshStack Framework Automates Creation of Realistic Benchmarks for Technical Document Retrieval Systems

New Benchmarks for Evaluating Retrieval Systems for Technical Documents

Searching for relevant information in technical documents is a central challenge in software development and many other fields. The quality of retrieval systems that perform this search is typically evaluated using benchmarks. A new framework called FreshStack now enables the automated creation of realistic benchmarks for this purpose. These benchmarks are based on real questions and answers from online communities and promise a more accurate assessment of the performance of retrieval systems.

Existing benchmarks often suffer from limitations that restrict their validity. They are frequently static, based on outdated data, and do not reflect the complexity of real search queries. FreshStack addresses these problems through a three-stage process: First, current code repositories and technical documentation are automatically collected. In the second step, so-called "nuggets" are extracted from questions and answers in online communities, which represent the core statements of the queries and their solutions. Finally, the collected documents are evaluated based on these nuggets, using a combination of different retrieval techniques and hybrid architectures.

Five datasets on rapidly growing, current, and specialized topics have already been created with FreshStack. The results show that common retrieval models, when applied to these datasets without adjustments, perform significantly worse than optimal solution approaches ("Oracle Approaches"). This highlights the potential for improvement in existing retrieval systems. Interestingly, it was also shown that so-called "rerankers," which are intended to refine the results of the initial search, did not achieve a significant improvement in accuracy in two out of five cases.

The developers of FreshStack hope that the framework will advance the development of more realistic, scalable, and uncontaminated benchmarks for Information Retrieval and Retrieval-Augmented Generation (RAG). The availability of such benchmarks is crucial for the further development of AI-powered search systems and the improvement of information retrieval in technical contexts. The datasets created with FreshStack are publicly available and offer researchers and developers the opportunity to test and optimize their retrieval systems under realistic conditions.

The automated creation of benchmarks by FreshStack offers several advantages. The process is reproducible and can be easily adapted to new subject areas. By using real questions and answers from online communities, it is ensured that the benchmarks reflect the actual needs of users. The combination of different retrieval techniques and architectures enables a comprehensive evaluation of system performance.

The results of the first tests with FreshStack underscore the importance of realistic benchmarks for the development of retrieval systems. The significant performance differences between common models and optimal solution approaches show that there is still considerable need for research. The findings on rerankers suggest that optimizing the initial search phase might be more important than the subsequent refinement of the results.

Conclusion

FreshStack represents an important step towards more realistic and meaningful benchmarks for the evaluation of retrieval systems. The framework enables the automated creation of datasets based on real questions and answers and reflects the complexity of technical documents. The initial results show that existing retrieval models still have significant potential for improvement. FreshStack is likely to significantly influence research and development in the field of Information Retrieval and RAG and lead to more powerful search systems.

Sources: Kastenberg, W. (n.d.). *Title of the work*. University of Twente. Author 2 (Year). *Title of the document*. Institution. Author 3 (Year). *Title of the article*. Conference or Journal. Author 4 (Year). *Title of the document*. Institution. Author 5 (Year). *Title of the article*. Conference or Journal. Author 6 (Year). *Title of the document*. Institution. Author 7 (Year). *Title of the document*. Institution. Semenyuk, M. (2023). *Title of the document*. Organization. Edler, J. (n.d.). *Title of the document*. New York University. Author 10 (n.d.). *Title of the document*. Institution.