CLIMB: A New Method for Optimizing Large Language Model Training Data

```html

Optimizing AI Training: New Method for Dataset Composition Introduced

Selecting the right training data is crucial for the performance of large language models. Existing methods, such as randomly compiling data from the internet (e.g., Common Crawl) or manually curating datasets (e.g., The Pile), each have disadvantages. The former often leads to an unbalanced database, while the latter is time-consuming and expensive. A new method called CLIMB (Clustering-based Iterative Data Mixture Bootstrapping) promises a remedy.

CLIMB, developed by a research team with participation from NVIDIA, offers an automated approach to identifying, evaluating, and refining datasets for training large language models. The procedure is based on the embedding and clustering of large datasets in a semantic space. Subsequently, CLIMB iteratively searches for optimal data mixtures using a smaller proxy model and a predictor.

The functionality of CLIMB can be divided into three main phases:

  • Embedding and Clustering: The input data is embedded into a semantic vector space, allowing similar data points to be grouped. These clusters represent different topics or domains.
  • Iterative Search: Using a proxy model and a predictor, the optimal mixture of clusters is iteratively searched for. The predictor estimates the performance of the proxy model based on the current data mixture.
  • Refinement: The data mixture is gradually refined by adjusting the weighting of the individual clusters. The goal is to maximize the performance of the proxy model.

Initial results demonstrate the potential of CLIMB. A 1-billion parameter model trained with this method surpassed the current state-of-the-art model Llama-3.2-1B by 2.0% when trained with 400 billion tokens. Furthermore, by optimizing for a specific domain (e.g., social sciences), a performance increase of 5% compared to a random data selection could be achieved.

As part of the research, two new datasets were also published: ClimbLab, a filtered corpus with 1.2 trillion tokens and 20 clusters, which serves as a research platform, and ClimbMix, a compact but powerful dataset with 400 billion tokens, designed for efficient pre-training and offering superior performance with the same token budget.

The development of CLIMB addresses the challenges in compiling optimal training data for large language models. The automated approach promises more efficient and effective use of data resources and could lead to further improvements in the performance of AI systems. In particular, the ability to tailor the data mixture to specific domains opens up new possibilities for specialized AI applications.

For Mindverse, a German company specializing in the development of AI solutions, these developments are of particular interest. The optimization of training data plays a central role in the development of customized AI solutions such as chatbots, voicebots, AI search engines, and knowledge systems. CLIMB and the associated datasets could contribute to further increasing the performance and efficiency of these solutions.

Bibliography:

Diao, S., et al. (2025). CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training. arXiv preprint arXiv:2504.13161.

Rogers, A., et al. (2024). What Will it Take to Fix Benchmarking in Natural Language Understanding? ResearchGate.

Farrelly, C. (2010). Masters Major Paper 5-DF. SlideShare.

Various Authors. Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024).

Various Authors. Proceedings of the 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2025).

Various Authors. Proceedings of the 2025 International Conference on Robotics and Automation (ICRA 2025).

Various Authors. Transactions on Machine Learning Research (TMLR).

Schütze, H. (2021). Introduction to Information Retrieval. Publikationen der SULB Universität des Saarlandes.

```