SmolDocling: A Compact Vision-Language Model for Multimodal Document Conversion

Top post
Compact AI for Multimodal Document Conversion: SmolDocling
Document digitization, particularly converting paper documents into digital formats, remains a challenge. Extracting information from complex layouts with tables, images, formulas, and diverse formatting requires specialized solutions. A promising approach lies in the application of Vision-Language Models (VLMs). These models combine image and text processing to extract and interpret content from documents. A new model in this field is SmolDocling, an ultra-compact VLM specifically designed for end-to-end document conversion.
Efficient Document Processing with DocTags
SmolDocling is characterized by its compact size and the use of a novel universal markup format called DocTags. DocTags enables the capture of all page elements within the context of their position and content. In contrast to conventional approaches, which often rely on large, resource-intensive models or complex pipelines of specialized models, SmolDocling offers an integrated solution. With only 256 million parameters, it allows for the precise capture of content, structure, and spatial information of document elements.
Versatile Applications and Robust Performance
The applications of SmolDocling span a wide range of document types, including business documents, scientific papers, technical reports, patents, and forms. The model demonstrates robust performance in reproducing document features such as code listings, tables, equations, diagrams, and lists. This versatility goes beyond the focus on scientific publications observed in many other VLMs.
New Datasets for Improved Training
As part of the development of SmolDocling, new, publicly available datasets for diagrams, tables, equations, and code recognition were created. These datasets contribute to improving the model's performance and optimizing the recognition of complex layouts.
Compact Size, High Performance
Experimental results show that SmolDocling can compete in performance with significantly larger VLMs, which have up to 27 times more parameters. At the same time, the model's compact size significantly reduces computational requirements, making it attractive for use in resource-constrained environments.
Future Prospects and Availability
SmolDocling is currently available, and the associated datasets are to be released soon. The model promises an efficient and precise solution for multimodal document conversion and could drive the automation of document processes in various fields.
Bibliography: - Nassar, A., et al. "SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion." arXiv preprint arXiv:2503.11576 (2025). - https://arxiv.org/abs/2503.11576 - https://arxiv.org/html/2503.11576v1 - https://neurips.cc/virtual/2024/poster/93655 - https://aclanthology.org/2023.emnlp-main.629.pdf - https://viso.ai/deep-learning/vision-language-models/ - https://github.com/friedrichor/Awesome-Multimodal-Papers - https://openreview.net/forum?id=t877958UGZ¬eId=GeIgCcEPuG - https://huggingface.co/papers/2412.04467