Curious how they've assessed quality, either qualitatively or quantitatively. How often do the generated documents miss important parts of the codebase or hallucinate requirements? How often do engineers have to redo work because the LLM convincingly told them to build the wrong thing?
You can build real, production-grade systems using LLMs, but these are the hard questions you have to answer.
This is not production ready yet, but based on my preliminary tests, the outputs are about 80% consistent. The plan ofcourse is for the architect to review the specs before getting devs assigned.
Yes. It's amazing we've gotten so far with LLM and everyone believing everyone else has actually validated their claims that _their_ LLM is producing valid output.
Essentially, you got a bunch of nergs generating code and believing that because it looks right, that this means every other subject matter being output is also correct.
My target was to reduce the manual work of creating documents, it's definitely a draft, needs to be reviewed by an architect and a QA lead before passing it on. The tasks generated will have the actual actionable task, that can be used for prompting in cursor or vs code.
It’s difficult to read posts that rely so heavily on AI generated prose.
Everything’s a numbered/bulleted list and the same old turns of speech describe any scenario.
That aside, what’s really keeping this from being useful is showing some results. How well does this approach work? Who knows. If the data is sensitive, seeing it work on an open source repo would still illuminate.
Also, we hear lots elsewhere about the limitations of relying on embeddings for coding tools, it would be interesting to know how those limitations are overcome here.
Interesting point on embedding, I'll research more on that. But as of now, in my knowledge, that's the best available way of identifying close matches. I'll try to find if there are any alternatives.
I haven't done intense tests yet, but based on my preliminary tests, the output is about 80% consistent. The others are like suggesting additional changes.
Hello HN, sorry for coming here late, it was past mid night for when the post was upped by the mods. I'll try to answer all the questions now, thanks for being patient.
Curious how they've assessed quality, either qualitatively or quantitatively. How often do the generated documents miss important parts of the codebase or hallucinate requirements? How often do engineers have to redo work because the LLM convincingly told them to build the wrong thing?
You can build real, production-grade systems using LLMs, but these are the hard questions you have to answer.
This is not production ready yet, but based on my preliminary tests, the outputs are about 80% consistent. The plan ofcourse is for the architect to review the specs before getting devs assigned.
They haven't.
Yes, it's not tested for large volume yet.
Yes. It's amazing we've gotten so far with LLM and everyone believing everyone else has actually validated their claims that _their_ LLM is producing valid output.
Essentially, you got a bunch of nergs generating code and believing that because it looks right, that this means every other subject matter being output is also correct.
My target was to reduce the manual work of creating documents, it's definitely a draft, needs to be reviewed by an architect and a QA lead before passing it on. The tasks generated will have the actual actionable task, that can be used for prompting in cursor or vs code.
Does anyone write anymore?
It’s difficult to read posts that rely so heavily on AI generated prose.
Everything’s a numbered/bulleted list and the same old turns of speech describe any scenario.
That aside, what’s really keeping this from being useful is showing some results. How well does this approach work? Who knows. If the data is sensitive, seeing it work on an open source repo would still illuminate.
Also, we hear lots elsewhere about the limitations of relying on embeddings for coding tools, it would be interesting to know how those limitations are overcome here.
Interesting point on embedding, I'll research more on that. But as of now, in my knowledge, that's the best available way of identifying close matches. I'll try to find if there are any alternatives.
Antony, you’d be right to call me out on providing a source. So in case it’s helpful, this is the last place I recall the subject being discussed:
RAG is Dead, Context Engineering is King
https://www.latent.space/p/chroma
I will check it out and make the updates necessary. Thank you for sharing that.
One easy way to judge the quality of of the spec the ai generates is to run it a few times on the same story and compare the differences
Curious if you tried that - how much variation does the AI do or does the grounding in codebase and prompts keep it focused and real?
I haven't done intense tests yet, but based on my preliminary tests, the output is about 80% consistent. The others are like suggesting additional changes.
Hello HN, sorry for coming here late, it was past mid night for when the post was upped by the mods. I'll try to answer all the questions now, thanks for being patient.
"outputs a full requirements document, a technical specification, a test plan, and a complete set of ready-to-work tasks"
No talking to those pesky people needed! I’m certain that an llm would spit out a perfectly average spec acceptable to the average user.
I assume you are me.
[dead]