Large Language Models for Symbolic World Model Generation: A New Benchmark

Large Language Models for Generating Symbolic World Models: A New Benchmark

The rapid development of large language models (LLMs) opens up exciting possibilities in a wide variety of application areas. One particularly promising field is the automatic generation of symbolic world models from text descriptions. These models allow complex scenarios and processes to be represented in a structured and machine-readable form, which in turn is of great importance for applications in areas such as planning, simulation, and robotics.

Previous research on generating world models using LLMs has encountered various challenges. The evaluation of the generated models was often subject to randomness and based on indirect metrics. In addition, the scope of application was mostly limited to a few specific domains. To address these limitations, a new benchmark called Text2World was recently introduced.

Text2World: A Comprehensive Benchmark for Evaluating LLMs

Text2World is based on the Planning Domain Definition Language (PDDL), an established language for describing planning problems. The benchmark comprises hundreds of different domains and uses multi-criteria, execution-based metrics for a more robust evaluation of the generated world models. In contrast to previous approaches, which often relied on syntactic comparisons, Text2World allows a direct verification of the functionality and consistency of the models.

The execution-based evaluation of Text2World offers decisive advantages. By simulating actions within the generated world models, their properties and behavior can be directly tested. This allows a more precise assessment of the quality and reliability of the models compared to purely syntactic methods.

Benchmark Results and Future Research

Initial benchmark tests with current LLMs show that models trained with reinforcement learning, in particular, achieve promising results. Nevertheless, the results also reveal that even the most powerful models still show significant weaknesses in generating complex world models.

To further improve the capabilities of LLMs in this area, various strategies are being investigated. These include test-time scaling, where the model size is dynamically adjusted during inference, as well as special training methods for agents that operate within the generated world models.

Text2World provides a valuable basis for future research in the field of world model generation. The benchmark enables a systematic and comparable evaluation of different LLMs and contributes to the development of more powerful models.

For companies like Mindverse, which specialize in the development of AI-based solutions, these advances are of particular interest. The ability to automatically generate symbolic world models from text descriptions opens up new perspectives for the development of innovative applications in areas such as chatbots, voicebots, AI search engines, and knowledge management systems.

Potential for Companies like Mindverse

The developments in the field of world model generation using LLMs offer companies like Mindverse a variety of application possibilities. By integrating Text2World into the development process, the capabilities of LLMs to generate symbolic world models can be systematically evaluated and optimized. This enables the development of customized AI solutions that can understand and process complex scenarios.

Bibliography:

https://huggingface.co/papers/2502.13092
https://arxiv.org/html/2502.13092v1
https://paperreading.club/page?id=285304
https://huggingface.co/papers
https://github.com/TianXingchen
https://tianxingchen.github.io/
https://arxiv.org/abs/2502.04728
https://www.chatpaper.com/chatpaper/es?id=3&date=1739894400&page=1
https://www.researchgate.net/publication/388847769_Generating_Symbolic_World_Models_via_Test-time_Scaling_of_Large_Language_Models
https://docs.api.nvidia.com/nim/reference/google-gemma-2-9b-it

```