Fietje: An Efficient Open-Source Language Model for Dutch

The development of powerful language models (LLMs) is progressing rapidly. While large models achieve impressive results, their size often presents a hurdle for deployment on resource-constrained devices. Fietje is a new, efficient LLM specifically for the Dutch language, designed to address this gap.

Foundation and Variants

Fietje is based on Microsoft's English-language model Phi 2, with 2.7 billion parameters. Through targeted training with 28 billion Dutch tokens, Fietje achieves performance comparable to significantly larger models like GEITje 7B Ultra. Fietje exists in three variants:

Fietje-2b: The base model serves general text generation and forms the foundation for the other variants.

Fietje-2b-instruct: This model is specialized for following instructions. It can answer questions, conduct conversations, and execute instructions.

Fietje-2b-chat: The chat variant has been further refined through preference optimization and is particularly well-suited as a chat assistant.

Performance Comparison

Although Fietje is significantly smaller than comparable Dutch LLMs, it achieves remarkable results in benchmarks. Tests with ScandEval (v12.6.1) show that Fietje even surpasses the performance of larger models in some areas. The results for squad-nl and mmlu-nl are particularly noteworthy. However, it is important to note that benchmarks for generative models have limited significance, and actual performance strongly depends on the specific task and the formulation of the prompts.

Detailed results, including confidence intervals and further metrics, are available on the ScandEval Leaderboard. The raw data, including the results of other models, can be found under evaluation/scandeval_benchmark_results.jsonl in the associated GitHub repository.

Training Data and Process

The training of Fietje took place in three phases: continued pre-training with Dutch texts, supervised fine-tuning with an instruction dataset, and preference optimization.

For the pre-training of the base model, 28 billion Dutch tokens were used. These are largely derived from CulturaX and the Dutch Wikipedia. A detailed description of the dataset, including the filtering methods applied for quality assurance, is publicly available.

Three datasets were used for supervised fine-tuning, two of them synthetically generated. The dataset comprises a total of 201,579 examples.

The preference optimization of the chat model was carried out with cleaned and evaluated datasets, comprising a total of 18,653 examples.

Usage Possibilities

Fietje can be used in various ways. The simplest option is the Hugging Face web interface. Local use is possible via LM Studio, Ollama, or Python. For beginners, use via LM Studio is recommended.

Outlook

Fietje represents an important step in the development of Dutch language models. The open availability of model weights, datasets, training code, and evaluation data promotes transparency and reproducibility and allows the community to build upon this work. Future developments promise further improvements and broader application possibilities for Dutch LLMs.

Bibliography Vanroy, B. (n.d.). fietje. GitHub. Retrieved December 28, 2024, from https://github.com/BramVanroy/fietje Vanroy, B. (n.d.). BramVanroy. Hugging Face. Retrieved December 28, 2024, from https://huggingface.co/BramVanroy Vanroy, B. (2024). GEITje 7B Ultra: A Conversational Model for Dutch. arXiv. https://arxiv.org/pdf/2410.12835 Vanroy, B. (2024, August 21). 👱♀️ Het is zover, the time has come! Fietje has arrived, a small and powerful #LLM for #Dutch. 🇧🇪🇳🇱. LinkedIn. https://www.linkedin.com/posts/bramvanroy_llm-dutch-llms-activity-7191073457948762113-Tg2m Vanroy, B. (n.d.). bramvanroy/fietje-2b-instruct:f16. Ollama. Retrieved December 28, 2024, from https://ollama.com/bramvanroy/fietje-2b-instruct:f16 Noels, S., De Blaere, J. & De Bie, T. (2024). A Dutch Financial Large Language Model. In 5th ACM International Conference on AI in Finance (ICAIF '24). Association for Computing Machinery. https://dl.acm.org/doi/fullHtml/10.1145/3677052.3698628 Noels, S., De Blaere, J., & De Bie, T. (2024). A Dutch Financial Large Language Model. ResearchGate. https://www.researchgate.net/publication/386002237_A_Dutch_Financial_Large_Language_Model Vanroy, B. (2024, October 27). I'm involved with 4 accepted submissions to Computational Linguistics in the Netherlands (#CLIN34; https://lnkd.in/eee7Qqgr). LinkedIn. https://www.linkedin.com/posts/bramvanroy_clin34-llm-dutch-activity-7208383161016172546-ERj8 Hugging Face. (n.d.). Papers. Retrieved December 28, 2024, from https://huggingface.co/papers?date=2024-12-23 Vanroy, B. (2024, December 22). Fietje: An open and efficient LLM for Dutch (huggingface.co). Hacker News. https://news.ycombinator.com/item?id=40240153