What you’re describing works surprisingly well for toy projects because the blast radius is small and the feedback loop is tight. E2E tests are basically acting as the only thing tethering the agent to reality.
Where we hit limits while building GTWY was when these “Ralph loops” ran long enough that intent drifted. The agent wasn’t wrong in any single step, but after hours of continuous execution it started optimizing for local goals that no longer matched the original problem.
What helped wasn’t smarter prompts, but breaking the loop into explicit, short-lived steps with forced checkpoints. Each step had a clear goal, clear inputs, and a clear end, and then the agent stopped. The next step only started once context was rebuilt deliberately.
Continuous agents feel magical early on, but they tend to accumulate hidden assumptions over time. In practice, bounded execution plus strong tests scaled more predictably than keeping the agent alive indefinitely.
Indeed I believe this is probably the next thing to solve but even here I don't think it is out of reach. What we aught to be able to do is disconnect and make asynchronous the goals of the project with where we are. This, in normal software building, is encapsulated by the roadmap. I am building roadmapping prompts now and broadening the scope of the software development lifecycle even further to the encapsulate the roadmap as well which was previously out of scope for the experiment I am running now.
The prompts I am using now give the agent autonomy over 'make the next prd that makes sense' however I think it is a straightforward extension to add 'in the context of the @roadmap/ ' or similar with probably decent results.
Have you tried something similar?
Even without a roadmap the agent continues to do useful work over 24 hours in. You can see the commits and PRDs they really are quite sensible and I pulled and tested and everything really is working quite well. Frankly, I am shocked it is working at all. I have had to step in once or twice you definitely need to keep an eye on the logs every once in a while. Getting the loop booted up in a reliable way was the hardest part to be honest and even that was not terribly difficult.
> it created a full multi-tenant auth system from scratch
OK. And did that scratch auth system pass any level of security testing? If it did, great, that is worth talking about. But what I've seen generated by AI isn't anywhere near secure.
What you’re describing works surprisingly well for toy projects because the blast radius is small and the feedback loop is tight. E2E tests are basically acting as the only thing tethering the agent to reality.
Where we hit limits while building GTWY was when these “Ralph loops” ran long enough that intent drifted. The agent wasn’t wrong in any single step, but after hours of continuous execution it started optimizing for local goals that no longer matched the original problem.
What helped wasn’t smarter prompts, but breaking the loop into explicit, short-lived steps with forced checkpoints. Each step had a clear goal, clear inputs, and a clear end, and then the agent stopped. The next step only started once context was rebuilt deliberately.
Continuous agents feel magical early on, but they tend to accumulate hidden assumptions over time. In practice, bounded execution plus strong tests scaled more predictably than keeping the agent alive indefinitely.
> intent drifted
Indeed I believe this is probably the next thing to solve but even here I don't think it is out of reach. What we aught to be able to do is disconnect and make asynchronous the goals of the project with where we are. This, in normal software building, is encapsulated by the roadmap. I am building roadmapping prompts now and broadening the scope of the software development lifecycle even further to the encapsulate the roadmap as well which was previously out of scope for the experiment I am running now.
The prompts I am using now give the agent autonomy over 'make the next prd that makes sense' however I think it is a straightforward extension to add 'in the context of the @roadmap/ ' or similar with probably decent results.
Have you tried something similar?
Even without a roadmap the agent continues to do useful work over 24 hours in. You can see the commits and PRDs they really are quite sensible and I pulled and tested and everything really is working quite well. Frankly, I am shocked it is working at all. I have had to step in once or twice you definitely need to keep an eye on the logs every once in a while. Getting the loop booted up in a reliable way was the hardest part to be honest and even that was not terribly difficult.
https://github.com/waynenilsen/ralph-kata-2
You'd believe LLM's could reverse engineering software.. but this is not the case today
> it created a full multi-tenant auth system from scratch
OK. And did that scratch auth system pass any level of security testing? If it did, great, that is worth talking about. But what I've seen generated by AI isn't anywhere near secure.
i have seen the same, however, it can often easily find its own bugs when prompted to do so, in this case, with a ticket perhaps
the ticket burndown is a very nice feature because whenever you want to add a ticket it'll just pick it up and do its best