Input Vocabulary Scaling Improves Transformer Model Performance

The Importance of Vocabulary for Scaling Transformer Models

Tokenization, the process of breaking down text into individual units (tokens), is a fundamental component of large language models (LLMs). However, the influence of tokenization on the scaling and performance of these models has not yet been fully explored. A new research approach, the so-called "Over-Tokenized Transformer," sheds light on this relationship and shows that the size of the vocabulary has a significant impact on model performance.

Traditionally, LLMs use the same vocabulary for input and output. The Over-Tokenized Transformer, on the other hand, decouples these two vocabularies. This allows for the use of a significantly larger vocabulary for the input, which can include multi-gram tokens, i.e., combinations of multiple words. By using these more complex tokens, the context can be better captured and model performance improved.

Extensive experiments have revealed a log-linear relationship between the size of the input vocabulary and the training loss. This means that larger input vocabularies, regardless of the model size, consistently lead to better model performance. With a large input vocabulary, performance comparable to twice as large base models can be achieved without incurring additional costs.

These results underscore the importance of tokenization for the scaling of language models. The common practice of keeping the vocabulary size constant and instead focusing on model size is challenged by this research. The findings suggest that expanding the input vocabulary is an efficient strategy for performance improvement and opens up new possibilities for the design of tokenizers.

Implications for the Development of Language Models

The research findings on the Over-Tokenized Transformer have far-reaching implications for the development and application of LLMs. By specifically expanding the input vocabulary, models can be trained more efficiently while also being made more powerful. This is particularly relevant for applications that require a deep understanding of context and nuances in language, such as machine translation, text generation, and question-answering systems.

The decoupling of input and output vocabularies allows for more flexible design of language models. While the input vocabulary can be expanded to better capture context, the output vocabulary can be tailored to the specific requirements of the respective application. This opens up new possibilities for the development of specialized language models optimized for specific domains or tasks.

The research on the Over-Tokenized Transformer highlights the need to consider tokenization as an integral part of the model architecture. The choice of tokenizer and the size of the vocabulary have a significant impact on the model's performance and should therefore be carefully considered. Future research in this area could focus on the development of adaptive tokenizers that dynamically adjust to the specific task and context.

Conclusion

The Over-Tokenized Transformer offers a promising approach to improving the performance of language models. Scaling the input vocabulary proves to be an effective strategy for increasing model performance without having to increase the model size. These findings open up new perspectives for the development of more efficient and powerful language models and underscore the importance of tokenization as a key component in the architecture of LLMs.

Bibliographie: Huang, H., et al. "A General and Efficient Training for Transformer via Token Expansion." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. Khalil, A., et al. "Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling." arXiv preprint arXiv:2501.16975, 2025. Khandelwal, U., et al. "Byte Latent Transformer: Patches Scale Better Than Tokens." arXiv preprint arXiv:2410.23168, 2024. Lan, Z., et al. "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." arXiv preprint arXiv:1905.03852, 2019. Liu, Y., et al. "RoBERTa: A Robustly Optimized BERT Pretraining Approach." arXiv preprint arXiv:1907.11692, 2019. Vaswani, A., et al. "Attention is All You Need." Advances in Neural Information Processing Systems, 2017. "Transformers tokenizers and the in-domain problem." Weights & Biases. "University of Amsterdam at TREC Deep Learning 2020." Proceedings of The Twenty-Ninth Text REtrieval Conference (TREC 2020). "Byte Latent Transformer: Patches Scale Better Than Tokens." ResearchGate. "Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling." Hacker News.