AI Agent Achieves Expert-Level Pokémon Battle Performance

Top post
AI Agent PokéChamp Masters Pokémon Battles at an Expert Level
Artificial intelligence (AI) is continuously conquering new domains. A fascinating example of this is PokéChamp, an AI agent capable of competing in high-level Pokémon battles. Developed based on a general framework for two-player strategy games, PokéChamp utilizes the capabilities of large language models (LLMs) to optimize the Min-Max search tree method.
The Min-Max search is an algorithm used in games with two opposing players to find the optimal move. It constructs a search tree where each node represents a possible game state. The algorithm evaluates each state and selects the move that guarantees the minimal loss (or maximal gain) in the worst-case scenario.
PokéChamp integrates LLMs in three key areas: player action selection, opponent modeling, and value function estimation. This allows the agent to effectively leverage game history and human knowledge to reduce the search space and handle the partial observability inherent in the game.
A remarkable aspect of PokéChamp is that the framework requires no additional training of the LLMs. In practice, this means that existing LLMs like GPT-4 or Llama models can be used directly. The developers tested PokéChamp in the popular Gen 9 OU format of Pokémon and achieved impressive results.
Using GPT-4 as its foundation, PokéChamp achieved a win rate of 76% against the previous best LLM-based bot and 84% against the strongest rule-based bot. Even with the open-source model Llama 3.1 with 8 billion parameters, PokéChamp surpassed the previous leader, Pok'ellmon (based on GPT-4), with a win rate of 64%.
PokéChamp's performance can be estimated on the Pokémon Showdown online ladder with an Elo rating of 1300-1500. This places the AI agent among the top 10% to 30% of human players.
As part of PokéChamp's development, the largest dataset of real Pokémon battles was also compiled. This comprises over 3 million games, including more than 500,000 games from players with high Elo ratings. Using this dataset, benchmarks and puzzles were developed to evaluate specific battle skills. Furthermore, important updates were made to the local game engine.
The developers hope that this work will stimulate further research that utilizes Pokémon battles as a benchmark to integrate LLM technologies with game-theoretic algorithms and solve general multi-agent problems. The combination of AI and game theory opens exciting possibilities for the development of intelligent systems that can operate in complex environments.
The research findings and the associated code are publicly available and offer a valuable resource for the scientific community.
Bibliography: Karten, S., Nguyen, A. L., & Jin, C. (2025). PokéChamp: an Expert-level Minimax Language Agent. *arXiv preprint arXiv:2503.04094*. Huang, W., Abbeel, P., Pathak, D., & Mordatch, I. (2024). Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. *Advances in Neural Information Processing Systems*, *37*. Yao, S., Zhao, J., Yu, D., Du, N., Sha, F., Narasimhan, M., ... & Cao, Y. (2023). React: Synergizing reasoning and acting in language models. *arXiv preprint arXiv:2210.03629*. Paranjape, B., Deshmukh, A., & Tandon, R. (2023). Guiding Large Language Models via Directional Stimulus Prompting. *OpenReview*.