Letting AI play my game – building an agentic test harness to help play-testing

(blog.jeffschomay.com)

119 points | by jschomay 13 hours ago ago

29 comments

Hey this is really cool! And your game is really inventive I’d love to try it when I’m home from work.

Have you considered NOT using an LLM to test your game? Because your game is turn based and text based, could you separate rendering and logic entirely (you may have already done this by the sounds of it) and run a headless simulator that simulates thousands of games using a monte-Carlo type method? Is your game fully deterministic outside of player input?

Reason I ask is I’m making a game, it’s fully deterministic the only randomness is player input. But same inputs = same outputs from my traditional AI enemies.

With this in mind, I was able to completely separate rendering and game logic, and to tune my enemy AI (traditional AI not LLM) I can run millions of simulated games headless and generate reports of the games, and basically toggle AI parameters automatically each game until my AI is “perfect” for its archetype signature.

I can run tens to hundreds of games in parallel, and I can run a typical 5 minute game in seconds.

Then I can capture that game and recreate it and watch replays etc.

My game is also a browser game, but I built my own engine for it from scratch and no external libraries

[-]

jschomay an hour ago

Thank you. You have a great suggestion. I didn't do that, but I did consider it and I think it can be very powerful. I had 2 example use cases that having an actual AI felt good for, first validating a new feature based on the spec, and second, finding unexpected bugs (like trying to enter a locked room through the back wall). It didn't do so well on the latter, but did great on the former. Having a million simulated games could probably catch those, but how would you track the reports after? Perhaps using an LLM to read the logs/reports could be a good use. Your set up sounds awesome, nice work.

[-]

purple-leafy 28 minutes ago

You’re welcome :) for you I’d recommend try get 10 games running / simulated first, and manually analyse the reports yourself to see if the report data is useful. Try and get the report data into a useful shape, and have it as either a json array or an excel. Then you can feed it into an llm to analyse.

For example for me my reports will basically be data points per AI archetype - like how often they collide with a wall, how often they perform certain actions, how often they get blocked or go idle. Straight numbers or booleans. This plus an ELO type system to rate the AI against one another so I can have an AI tier list. Then I can get an LLM to ingest the data and pick out issues / outliers etc.

My game is kinda like chess so this all makes sense for my game.

And thanks for the insights I will try a similar llm setup for manually playing my game, it’s definitely possible and it’s inspiring from your blog

deadbabe 20 minutes ago

This is the way to have a very tightly balanced game. I’ve seen people come up with a lot of sophisticated graphs and curves of various params and inputs that I personally don’t understand, but they tune things to values that naturally result in the kind of outcomes players will enjoy best. It would be impossible to just tweak all these variables and their interactions just through manual play tests alone.

justindz 10 hours ago

What a great lunch read! I've been weekend-warrioring a terminal-based CRPG for a bit myself. I was recently exploring ways to use agents to help with balance testing, which is a real scale problem for solo indie dev. So far, all I've created is a fight simulator: essentially, have the current player state (stats, effects, gear, companions, etc.) do this fight, simulated, X number of times using one of the currently-implemented GOAP personalities and report how often it wins, loses, average end turn, stuff like that.

I hadn't really thought about trying to create a harness for agents to play the full game interactively. I'd love to explore this. If you don't mind, here are a few questions:

1) Correct to assume that I probably need a text-only harness even though my game is text-based already because I do make use of menu selections made via arrow-key-and-enter interactions?

2) Do you have prompt recommendations for the type of feedback you have found to be useful? I would guess in your case, the objectives of the game are more clear than an open-world RPG. What dead ends have you run into? Maybe a variety of approaches would be good? One agent tries to fight everything. Another focuses on gaining and completing as many quests as possible?

3) How bad is the token burn doing this? Any optimization strategies you've employed?

[-]

jschomay an hour ago

OP here: Thank you and I appreciate the thoughtful questions. To answer: 1) I used a text representation because it made sense for my game and let me "render" certain details in a more AI-friendly way, like the compact map. You could use something like agent-browser and it would probably work just fine, but I figured it added an extra layer of indirection that I didn't need, plus it would be a lot of screenshots! Being able to have a turn based loop really helped make this work.

2) I had a skill on just how to use the playtest server. I also gave it context on what the game is and how to play it. From there, it probably depends on your use case. I wasn't that impressed with its natural ability to playtest for bug discovery, so I would consider making a skill describing what a playtester would normally do. Focused playtester instances is a good idea. Ultimately what I found to be most helpful was to point it at a feature or bug that I was aware of and have it validate it. Not only was it fairly successful, that was the part that saved the most time for me.

3) I think I only burned about 300K tokens on my longest play-test session, and that includes a bunch of code tweaks too. Running it after every feature as a validation step is pretty cheap. Running it overnight in "open" playtesting could add up.

Good luck, please let me know how it goes if you get somewhere helpful!

lubujackson 8 hours ago

I did something similar, but instead of having the LLM play the game I had it build an entire bot system to play the game. Bots require much more determinism, but I'd rather burn tokens encoding problem solving approaches and bot decision profiles than using LLMs for every turn of the game. This can be developed rapidly if you create an agent in a loop and say "figure out how to have the bot reach room 3 in under 10 actions" or something like that. It is easy for this to get bloated, but I found it makes a nice feedback loop that allows me to quickly test things like pacing changes and think of the game as a series of user actions that can be sculpted purposefully.

[-]

justindz 8 hours ago

Thanks, this is another great idea and I'll consider it as an addition or alternative. Do you think this works in an open-world, non-linear type game?

StephenAshmore 10 hours ago

I've been doing something similar on my own weekend game! I've got two games in rust I'm working on, a simple one in tauri and a more traditional 2D game. For both, I added a CLI that allows me or AI to play the game and test. It hooks into the actual game state just like here as another way to "render" the game. I think this is pretty similar to end-to-end testing strategies, but with the current state of AI you can have really interesting testing while you're building something. I appreciate starting a fresh AI with no context on the game and giving it just instructions on how to use the CLI. It's an extra pair of eyes for rubber-ducking.

squeegmeister 11 hours ago

I recently added E2E tests in my game too. One of the benefits is that I can have my agent verify its own work by asking it write a test and look at screenshots. Which means I can say “I’m going to bed, implement this and verify it with e2e tests” and it gets further along than it used to

Jabrov 10 hours ago

I can’t wait until the distant future where strategy games will have actually good and interesting AI that can communicate and reason

fishtoaster 7 hours ago

I landed on something similar for my own game, though it's been pretty tricky.

I'm building a physics-based 2d game involving slingshotting around planets. The realtime nature of it has meant that it's nearly impossible for the AI to test using a browser mcp. It'll take one screenshot, then another, and in the intervening time the player shot off the map and into deep space.

Instead I gave it both a code-level api to step forward and backward the physics engine and a browser-based, `window.game` api to do it via a browser mcp console. The former helps it work out physics bugs and the latter helps it test animation and UI issues.

It's still not great. I keep occasionally getting "I tested it and it works perfectly!" as I stare at the mcp'd browser with the player stuck clipped halfway into a planet. I think, if anything, I need to lean harder into this approach: building really solid tooling for the AI to inspect every aspect of state. I would kill for a turn-based game like OP XD

[-]

jschomay an hour ago

OP here, cool to see all the similar yet varied testing approaches! Your situation sounds tricky with a real time physics based game. Converting it to step based sounds like it has promise, but as you mentioned, every dilution to the full e2e harness also dilutes the validation veracity. When you were describing your game I kept thinking of Bret Victor's "Inventing on Principles" talk where he "collapses time" in a physics game to render trajectories of objects in all positions at once to visually intuitively tell if it works right. Perhaps that could apply?

chrisweekly 12 hours ago

This is awesome. Thanks for sharing! The text-based renderer reminds me of playing Larn on my dad's VT100 when I was a child (early 80s).

jongalloway2 9 hours ago

I've been doing this lately, building a Godot game with Copilot CLI. I'm using Godot MCP Pro which can automate interactions and screenshots, and have the whole game script in a markdown doc. I was happily surprised when I asked for a walkthrough and it all just worked, found and fixed some regressions while I was sleeping.

shnippi 10 hours ago

This is sick, thanks for sharing! We've been working on very similar things for the past 2 years. We also started with a text-only representation, but sadly quickly realized that only a small subset of games work well with this.

So we went down a rabbit hole and decided to do everything purely based on pixels and OS inputs.

We're currently only live for mobile but happy to give you early access to nunu ai for PC if interested. Would love to see how we compare!

[-]

jschomay an hour ago

Hi, thanks! That sounds really interesting. I'm curious how it compares too. What is the best way to get in touch?

ZeidJ 6 hours ago

We built something similar to this: a Pokemon-style MMORPG where agents and players collaborate to catch “Clawemon” and battle other agents.

We posted it online and surprisingly got a lot of negative feedback from users mentioning they would never spend valuable tokens on playing a game.

Our intention was to create an interaction experiment to see how agents interact with each other and with their human companions. We ended up making a pretty fun game in the process, which we're still working on.

Bring your own inference as a potential future of gaming does not seem too far off.

For anyone interested here is the HN post: https://news.ycombinator.com/item?id=47849872

moconnor 4 hours ago

This is the future of all software; the benefits of making it accessible to agents are overwhelming.

zoetaka38 11 hours ago

Built something similar for E2E web testing recently. A few observations from running an agentic test harness in production:

1. The single biggest jump in test quality came from giving the agent BOTH source code analysis AND live browser snapshots, not either alone. With code-only the agent hallucinates selectors; with browser-only it misses project conventions. Two MCP servers feeding the same agent — one local file-read, one Playwright in-process — was the architecture that worked.

2. For the browser snapshot tool, returning the raw DOM ate tens of thousands of tokens per call and the agent struggled to navigate it. Swapping to accessibility-tree refs (e1, e2, ...) cut token usage by ~10x and made the agent reliably target the right elements.

3. We avoided Docker-based MCP servers in production (we run on ECS Fargate). The in-process SDK MCP pattern (create_sdk_mcp_server + @tool decorator) keeps the browser handle in scope of the tool definition, which let us attach page.on('console') listeners and have the agent read them via a separate tool. Hard to do that across stdio process boundaries.

For game testing specifically — your text-renderer detail is interesting because it sidesteps the visual-grounding problem (how does the agent verify what it's seeing?). Curious how you'd extend this to a 2D/3D rendered game where the screen state isn't easily textualized.

haunter 8 hours ago

Is there an AI which can "solve" the Path of Exile 1/2 passive skill tree yet?

[-]

nickstinemates 8 hours ago

Solve for what?

The degree of choice point-to-point in the skill tree is actually quite limited in most circumstances. There are obviously items, like thread of hope, intuitive leap, or inversion of choice items like unnatural instinct which change it slightly.

If the question is path optimization to utilizing these nodes, Path of Building already does a good job. If the question is "what single node will give me the most theoretical power." It also solves that.

That's actually the beauty of Path of Exile as a whole - the different systems works in combination to lead to an outcome. As an example, If you're a life stacking build, finding unique ways to get as many life/strength nodes as possible. That's your gear and your passive tree working in tandem.

Speaking about using AI to optimize characters - not just the skill tree - you'd need to build some pretty sophisticated tools which do not yet exist to make that happen. No AI alone would be able to do it.

Modified3019 9 hours ago

My earliest desire for real AI was so it could control my dumb fucking harvester in C&C95.

[-]

Sohcahtoa82 6 hours ago

I seem to remember the fatal flaw with harvester AI was that once a harvester was returning to the drop-off building, it would "claim" it, and so any other harvesters would just do a dance around the building until the the first harvester arrived. As a result, a harvester that was further away could block closer trucks if it just happened to fill sooner.

empath75 10 hours ago

I hooked up an MCP server to a MUD and got some pretty amazing results, including Claude Code agents in separate windows chatting with each other and cooperating on building out a new section.

[-]

ramses0 9 hours ago

Do share, pray tell! Which MUD were you using? I've been poking around at MUD/MOO-adjacent capabilities and am having to hold the AI back from authoring it's own MUD/MOO capabilities instead of dorking with an existing server (likely that's full of security holes and complex bespoke startup+install configurations)

I'd like `mud_or_moo --state-dir ./tmp/some-mud` which stored most things as plain text or maybe SQLite if really necessary? The core of a MUD which was conceptually similar to a wiki-browser against markdown files (ie: room-001.md => exits => room-002.md) is what i'm angling towards, such that _editing and linking_ felt more comfortable and GUI to a human user.

[-]

empath75 9 hours ago

I forked evennia and added it. Took me a few hours with claude.

Once i had the core authorship mcp's working, claude itself created the whole world, including an initial tutorial sequence, combat, etc...

[-]

ramses0 6 hours ago

Kindof landed on evennia as the seeming sweet spot in reaction to your comment.

I've walked an agent through Home Assistant => Wiki-per-room => Zork-Me! ...and it turns out that the actual Inform Zork engine is pretty terrible but it's fun to say "go north ; look table" (and eventually "turn on ha.light_001" ;-).

The "MUD/MOO" aspect is where it opens interesting options of actually curling out to the home assistant instance, and the just kindof wild fun of making a functional "quest" in the context of your own home (eg: solve a mystery? make dinner? battling another user for the TV remote? :-D)

ticulatedspline 8 hours ago

Cool, I was thinking about this very thing. Was looking at CoffeeMud and wondered if I gave it a starting room and a clean slate if it could basically just build out a whole Mud from scratch.