Multilingual Continual Pre-training Strategies for Large Language Models

Multilingualism in LLMs: Rethinking Continual Pre-Training
Large language models (LLMs) have revolutionized natural language processing, but their performance varies significantly across different languages. While resource-rich languages like English benefit greatly from LLMs, resource-poorer languages often lag behind. Continual pre-training (CPT) has emerged as a promising approach to bridge these gaps. However, the optimal strategy for CPT in a multilingual context, particularly the choice between monolingual, bilingual, and code-extended data, is not yet fully understood.
A new study systematically investigates the effectiveness of different CPT configurations for multilingual LLMs. The researchers evaluated 36 different configurations, based on three multilingual base models and considering over 30 languages. These languages were categorized into three groups: altruistic, selfish, and stagnant languages, covering the spectrum of resource availability. The results of this study provide important insights for the further development of multilingual LLMs.
Bilingual CPT: Improved Classification, but Risk of Language Mixing
The study shows that bilingual CPT improves the accuracy of multilingual classification. However, this method also carries the risk of language mixing, especially in generative tasks. This means that the model can unintentionally switch between different languages during text generation, which impairs the quality of the results.
Code-Extended CPT: Gains for Low-Resource Languages, but Losses in Generation
The integration of programming code into CPT training improves classification accuracy for many languages, especially for low-resource ones. This positive effect comes at the cost of a slight decline in quality for generative tasks. The study thus highlights a trade-off between classification and generation performance when using code-extended CPT.
Altruism, Selfishness, and Stagnation: A Differentiated Picture of Cross-Lingual Transfer
Contrary to previous assumptions, the study shows that the categorization of languages as altruistic, selfish, or stagnant with regard to cross-lingual transfer is not always accurate. Altruistic languages, which theoretically should promote transfer to related languages, may even hinder it under certain circumstances. Selfish languages exhibit context-dependent behavior that strongly depends on the specific CPT configuration. Surprisingly, stagnant languages show unexpected adaptability under certain conditions.
These results underscore the complexity of multilingual representation learning. They emphasize the need for systematic studies on language classification and its influence on cross-lingual transfer to optimize future CPT strategies and improve the performance of LLMs across different languages. The findings of this study are particularly relevant for companies like Mindverse, which specialize in the development of AI-powered language solutions, as they can help improve the accessibility and effectiveness of chatbots, voicebots, AI search engines, and knowledge systems in a multilingual context.
Bibliography: Li, Z., Ji, S., Luo, H., & Tiedemann, J. (2025). Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources. arXiv preprint arXiv:2504.04152. Wang, H. et al. (2024). A Survey on Continual Learning for Large Language Models. arXiv preprint arXiv:2410.14815. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2024). Wang-ML-Lab/llm-continual-learning-survey. GitHub repository. Lazaridou, A., et al. (2024). Towards Continual Knowledge Learning of Language Models. Data Intelligence, 6(3), 311-331. Febriyanti, I. A., et al. (2025). Continual Learning for Low-Resource Neural Machine Translation. Proceedings of the 2025 International Conference on Indonesian Language Processing (IndoNLP). Pan, F., & Yang, Q. (2004). A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345-1359. Proceedings of the 2025 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL 2025). Tiedemann, J., & Thuy, N. V. (2024). Enhancing Translation Accuracy of Large Language Models through Continual Pre-Training on Parallel Data. ResearchGate preprint. Faruqui, M., et al. (2024). Cross-lingual Language Model Pretraining. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP).