Baba is Eval

3rd of July, 2025

Repository

Baba is You is a sokoban puzzle game where the rules themselves have to be manipulated to win. (For the uninitiated, the store page should explain the idea best.) The level of abstraction required to solve most levels makes it a formidable reasoning benchmark, with many reasoning steps being completely orthogonal to other tasks out there. The game is turn-based, meaning the number of turns required to solve a level naturally serves as a more fine-grained metric beyond accuracy.

This renders Baba is You quite similar to the proposed ARC-AGI-3 benchmark, scheduled for release in 2026. Except it already exists! That is, however, also the main problem for using it as a serious benchmark: The solutions for the main game are out there in both text and image form. Luckily though, if that ends up being a problem, there is also a wealth of clever and high-quality levels, and even level packs with entirely new mechanics, created by players. Those mostly don't have a solution published online.

Inspired by Claude plays Pokémon and the Factorio Learning Environment, in this devlog we'll turn Baba is You into a demo version of Baba is Eval.

Desiderata

Be it Factorio or ARC-AGI, usually current multimodal models still do best with a text-representation of the 2D world. Screenshots are often less helpful. Therefore, we need to implement (1) fetching the game state into the language model context. Then, (2) the model should be able to control the level, which involves only the principal actions left, right, up, down, undo and reset. Ideally, this would be faster than a human takes to input. We'll also want (3) menu navigation to completely automate the state management.

Fetching Game State

Humans interact with the game visually, so the first thought might be to read it in via a vision model. In the case of Baba is You though, it pays off to look at the game files exposed to us. Opening the game files, we see the binary size itself is only 8MB. Quite a bit of game logic is implemented in plaintext Lua scripts, extending the base engine Multimedia Fusion 2. The game even defines hooks to be used by mods, which fire on events like "level_start", which is perfect for us.

Out of all exposed functions (documentation), we find two that allow I/O: MF_read(storage(str), group(str), item(str)) -> result(str) and MF_store(storage(str), group(str), item(str), value(str)). These write to one of four predefined storage files (such as "level"), in an INI format with sections delineated by [group] followed with key=value pairs on each new line.

To actually get the current game state, there luckily is a function MF_getunits() -> {fixed(int)}. This returns a table of objects that, as inferred by other uses in the source code, can be deserialized with mmf.newObject, then yielding a Lua object table containing all the entities in a level. While the entities' fields aren't accessible, through other instances in the code we can tell it has properties UNITNAME, XPOS, YPOS, ZPOS. We can now construct a table of all the elements in a level and put that in the context. We also need a way to signal when the game has been won, which can be recorded in the level_won mod hook.

We set up a Python MCP server. It gets a tool to displays this information. On every state change, we serialize the table from Lua, then read it in on demand with configparser from Python. Because language models aren't that best at spacial reasoning from coordinates, we would want to print a grid with all the entities. We also need to find out the bounds of the level, which are conveniently already loaded in a global variable in Lua (roomsizex). For multiple entities on top of each other at the same X,Y-position, we print them in the same cell ordered by their Z value ("z99>z1"). Although it matters in some levels, we ignore the direction of an object.

Let's take a look at a small level to demonstrate, "Lake-Extra 1: Submerged Ruins". If you like, you can assume the role of the LLM and come up with a solution, if you happen to know the game already. It's quite tricky, but possible, to solve it without trial and error.
Baba is You level screenshot

Calling the MCP tool returns:

y/x|    1    |   2   |    3    |   4   |   5    |  6  |  7  |  8  |  9  |  10 |    11   |    12   |     13    |   14   |  15
---+---------+-------+---------+-------+--------+-----+-----+-----+-----+-----+---------+---------+-----------+--------+-----
1  |         |       |         |       |        |     |     |     |wall |wall |wall     |wall     |wall       |wall    |wall
2  |         |       |         |baba   |        |     |     |     |rock |     |         |text_crab|text_flag  |        |wall
3  |         |       |text_baba|text_is|text_you|     |     |     |wall |     |         |wall     |text_is    |text_is |wall
4  |         |       |         |       |        |     |     |     |wall |     |         |wall     |text_defeat|text_win|wall
5  |         |       |         |       |wall    |wall |wall |wall |wall |     |         |wall     |wall       |wall    |wall
6  |         |       |         |       |wall    |     |     |     |wall |     |         |         |           |        |wall
7  |         |       |         |       |crab    |crab |flag |     |wall |     |text_rock|text_is  |text_push  |        |wall
8  |text_wall|text_is|text_stop|wall   |wall    |     |     |     |wall |     |         |         |           |        |wall

This looks like a surprisingly comfortable format to play the game even as a human, which is a good sign.

Control

We could simulate key presses, but that's a bit boring and slow, especially in comparison to a direct call from code. In syntax.lua, we find command(key,player_), which gives access to the four movement directions and restart. There is also undo() in undo.lua.

The problem is how to call these asynchronously. Maybe there is some way to define a new hook, but I found only the following ugly method. In the always mod hook, we attempt to open a new command Lua file, and execute its contents if it exists. From the server, we know what the Lua backend is looking for and asynchronously write to that file when the language model has decided on commands. This gives high latency of 50-150ms per list of commands read, but at least the commands themselves are executed nearly instantly one after the other, much faster than keypresses.

Manually solving the level we looked at above, we find a solution rrrrrrddddrrrrdldllluuuuurulllllllrrrrrrddddddrrruldlluuuuurullllllrrrrrrddddrdldluuuuurulllllullldlllllddlldluldddrruldlluurddddldrrrrrlluuurrrruurrrrllllllullld, where each letter stands for left, right, up or down, giving 324 bits. Executing this via MCP, the result is:

Did you catch what happened? The screen capture frequency is not fast enough to get even a single in-between frame, but the game graphics themselves are even slower. Apparently the true position and current screen space position are interpolated to get the next frame, so for 12 frames or so we can see all entities move from their initial to the final position in a straight line.

Level Selection

This is a surprisingly hard part of the problem. The Lua-side code for level selection is sparse and complicated, so to save some development time and add a cool visual effect, we enter levels with the same input functionality to navigate on the map, then simulate two enter presses with pyautogui. This further uglifies the solution, especially because simulated inputs are seemingly quite unreliable.

The Language Model Client

We use Claude Desktop as the demo client, again mainly for visual effect. It feels so weird to have a consumer-facing app do things like this. We also get some basic tool call prompting (and context management?) for free, which is very helpful. We also implement a help function in the MCP so that the model can get an explanation for the game rules and keywords, in case it doesn't know.

Results

Claude 4 is pretty bad at this. It can reliably solve level 0, where the solution is inputting "right" 8 times. Beyond that though, it struggles with all component subtasks of even the first levels: Keeping track of the rules, identifying blocked paths, planning, getting input patterns correct, keeping track of the win condition, identifying a lost game, coming up with rules to try, identifying rules that need to be broken, et cetera. It's François Chollet's insight playing out live. This is why the video of Claude solving level 1 at the top was actually (dramatic musical cue) staged, and only possible via a move-for-move tutorial that Claude nicely rationalized post hoc.

Reasoning models like o3 might be better equipped to come up with a plan, so a natural step would be to try switching to those, away from Claude Desktop. This would also enable more sophisticated context management, which is needed because for more complicated, larger levels the game states would start using too many tokens. A more dense representation of the game state, specifically for tokenizers instead of humans, e.g. with less whitespace, could also help. Finally, as with the Factorio Learning Environment, maybe the input space can be abstracted with, say, a move_to() tool. Only for some levels, like those containing PULL and MOVE, full control is really needed.

Baba is You any% is still a while off. If you'd like to keep an eye on the progress bar for it, and maybe try your hand at developing it, you can head over to the repo for this project. I anticipate that many readers will have ideas on how to do this much better than the above.