The demos I see for these types of tools are always some toy project and doesn't reflect day to day work I do at all. Do you have any example PRs on larger more complex projects that have been written with codebuff and how much of that was human interactive?
The real problem I want someone to solve is helping me with the real niche/challenging portion of a PR, ex: new tiptap extension that can do notebook code eval, migrate legacy auth service off auth0, record and replay API GET requests and replay a % of them as unit tests, etc.
So many of these tools get stuck trying to help me "start" rather than help me "finish" or unblock the current problem I'm at.
Absolutely! Imaging setting a bunch of css styles through a long winded AI conversation, when you could have an IDE to do it in a few seconds. I don't need that.
The long tail of niche engineering problems is the time consuming bit now. That's not being solved at all, IMHO.
+1; Ideally I want a tool I don't have to specify the context for. If I can point it via config files at my medium-sized codebase once (~2000 py files; 300k LOC according to `cloc`) then it starts to get actually usable.
Cursor Composer doesn't handle that and seems geared towards a small handful of handpicked files.
Would codebuff be able to handle a proper sized codebase? Or do the models fundamentally not handle that much context?
Yes. Natively, the models are limited to 200k tokens which is on the order of dozens of files, which is way too small.
But Codebuff has a whole preliminary step where it searches your codebase to find relevant files to your query, and only those get added to the coding agent's context.
That's why I think it should work up to medium-large codebases. If the codebase is too large, then our file-finding step will also start to fail.
I would give it a shot on your codebase. I think it should work.
What's the fundamental limitation to context size here? Why can't a model be fine-tuned per codebase, taking the entire code into context (and be continuously trained as it's updated)?
Forgive my naivety, I don't now anything about LLMs.
RAG is a well-known technique now, and to paraphrase Emily Bender[1], here are some reasons why it's not a solution.
The code extruded from the LLM is still synthetic code, and likely to contain errors both in the form of extra tokens motivated by the pre-training data for the LLM rather than the input texts AND in the form of omission. It's difficult to detect when the summary you are relying on is actually missing critical information.
Even if the set up includes the links to the retrieved documents, the presence of the generated code discourages users from actually drilling down and reading them.
This is still a framing that says: Your question has an answer, and the computer can give it to you.
We actually don't use RAG! It's not that good as you say.
We build a description of the codebase including the file tree and parsed function names and class names, and then just ask Haiku which files are relevant!
This works much better and doesn't require slowly creating an index. You can just run Codebuff in any directory and it works.
It sounds like it's arguably still a form of RAG, just where the retrieval is very different. I'm not saying that to knock your approach, just saying that it sounds like it's still the case where you're retrieving some context and then using that context to augment further generation. (I get that's definitely not what people think of when you say RAG though.)
Genuine question: at what point does the term RAG lose its meaning? Seems like LLMs work best when they have the right context, and that context must be pulled from somewhere for the LLM. But if that's RAG, then what isn't? Do you have a take on this? Been struggling to frame all this in my head, so would love some insight.
I hear you. This is actually a foundational idea for Codebuff. I made it to work within the large-ish codebase of my previous startup, Manifold Markets.
I want the demos to be of real work, but somehow they never seem as cool unless it's a neat front end toy example.
It's pretty good for complex projects imo because codebuff can understand the structure of your codebase and which files to change to implement changes. It still struggles when there isn't good documentation, but it has helped me finish a number of projects
One cool thing you can do is a ask Codebuff to create these docs. In fact, we recommend it.
Codebuff natively reads any files ending in "knowledge.md", so you can add any extra info you want it to know to these files.
For example, to make sure Codebuff creates new endpoints properly, I wrote a short guide with an example on the three files you need to update, and put it in backend/api/knowledge.md. After that, Codebuff always create new endpoints correctly!
you can put the information into knowledge.md or [description].knowledge.md, but sometimes I can't find documentation and we're both learning as we go lmao
Kind of like "please describe the solution and I will write code to do it".
That's not how programming works.
Writing code and testing it against expectations to get to the solution, that's programming.
Great question – we struggled for a long time to put our demo together precisely for this reason. Codebuff is so useful in a practical setting, but we can't bore the audience with a ton of background on a codebase when we do demos, so we have to pick a toy project. Maybe in the future, we could start our demo with a half-built project?
Hopefully the demo on our homepage shows a little bit more of your day-to-day workflows than other codegen tools show, but we're all ears on ways to improve this!
To give a concrete example of usefulness, I was implementing a referrals feature in Drizzle a few weeks ago, and Codebuff was able to build out the cli app, frontend, backend, and set up db schema (under my supervision, of course!) because of its deep understanding of our codebase. Building the feature properly requires knowing how our systems intersect with one another and the right abstraction at each point. I was able to bounce back and forth with it to build this out. It felt akin to working with a great junior engineer, tbh!
I've been using Codebuff (formerly manicode) for a few weeks. I think they have nailed the editing paradigm and I'm using it multiple times a day.
If you want to make a multi-file edit in cursor, you open composer, probably have to click to start a new composer session, type what you want, tell it which files it needs to include, watch it run through the change (seeing only an abbreviated version of the changes it makes), click apply all, then have to go and actually look at the real diff.
With codebuff, you open codebuff in terminal and just type what you want, and it will scan the whole directory to figure out which files to include. Then you can see the whole diff. It's way cleaner and faster for making large changes. Because it can run terminal commands, it's also really good at cleaning up after itself, e.g., removing files, renaming files, installing dependencies, etc.
Both tools need work in terms of reliability, but the workflow with Codebuff is 10x better.
It's become my go-to tool for handling fiddly refactors. Here’s an example session from a Rust project where I used it to break a single file into a module directory.
Quality of code wise, is it worse or better than Cursor? I pay for Cursor now and it saves me a LOT of time to not copy files around. I actually still use the chatGPT/claude interfaces to code as well.
If its the same, hard to justify $100/month vs $20/month . I code mostly from vim so I'm searching for my vim/cli replacement while I still use both vim and Cursor.
That sounds cool and I like the idea, but definitely won't pay 5x. Maybe charge $30/month plus bring your own key. Let me know when you lower the price :)
It might sound small, but pulling in more context can make a huge difference – I remember one time Cursor completely hallucinated Prisma as part of our tech stack and created a whole new schema for us, whereas Codebuff knew we were already hooked up to Drizzle and just modified our existing schema. But like James said, we do use more tokens to do this, so pros & cons.
Yes, this is a good point. I think not asking to run commands is maybe the most controversial choice we've made so far.
The reason we don't ask for human review is simply: we've found that it works fine to not ask.
We've had a few hundred users so far and usually people are skeptical of this at first, but as they use it they find that they don't want it to ask for every command. It enables cool use cases where Codebuff and iterate by running tests, seeing the error, attempting a fix, and running them again.
If you use source control like git, I also think that it's very hard for things to go wrong. Even if it ran rm -rf from your project directory, you should be able to undo that.
But here's the other thing: it won't do that. Claude is trained to be careful about this stuff and we've further prompted it to be careful.
I think not asking to run commands is the future of coding agents, so I hope you will at least entertain this idea. It's ok if you don't want to trust it, we're not asking you to do anything you are uncomfortable with.
I am not afraid of rm -rf whole directory. I am afraid of other stuff that it can do to my machines. leak my ssh keys, cookies, persnal data, network devices, and making persistent modifications (malware) to my system. Or maybe inadvertently messing with my python version, or globally installing some library to mess up whole system.
It's mainly from experience. From when I set it up I didn't have the feature to ask whether to run commands. It has been rawdogging commands this whole time and has never been a problem for me.
I think we have many other users who are similar. To be fair, sometimes after watching it install packages with npm, people are surprised and say that they would have preferred that it asked. But usually this is just the initial reaction. I'm pretty confident this is the way forward.
Do you have any sandbox-like restrictions in place to ensure that commands are limited to only touching the project folder not any other places in the system?
It's strange that all the closed models whose mentioned reasons for being closed is safety is allowing this, and banning the apps which allows for erotic roleplay all the time. Roleplay is significantly less dangerous than full shell control.
been using cline extension in vscode (which can execute commands and look at the output on terminal) and it's an incredibly adept sysadmin, cloud architect and data engineer. I like that cline lets you approve/decline execution requests and you can run it without sending the output which is safer from a data perspective.
It's cool to have this natively on the remote system though. I think a safer approach would be to compile a small binary locally that is multi-platform, and which has the command plus the capture of output to relay back, and transmit that over ssh for execution (like how MGMT config management compiles golang to static binary and sends it over to the remote node vs having to have mgmt and all it's deps installed on every system it's managing).
Could be low lift vs having a package, all it's dependencies and credentials running on the target system.
I'd assume the person giving the praise is at least a bit of all 3.
> It’s a weird catch-22 giving praise like that to LLMs.
It's a bit asymmetrical though isn't it -- judging quality is in fact much easier than producing it.
> you might be able to intuit and fill in the gaps left my the LLM and not even know it
Just because you are able to fill gaps with it doesn't mean it's not good. With all of these tools you basically have to fill gaps. There are still differences between Cline vs Cursor vs Aider vs Codebuff.
Personally I've found Cline to be the best to date, followed by Cursor.
What if you have a microservice system with a repo-per-service setup, where to add functionality to a FE site you would have to edit code in three or four specific repos (FE site repo + backend service repo + API-client npm package repo + API gateway repo) out of hundreds of total repos?
Codebuff works on a local directory level, so it technically doesn't have to be a monorepo (though full disclaimer: our codebase is a monorepo and that's where we use it most). Most important thing is to make sure you have the projects in the same root directory so you can access them together. I've used it in a setup with two different repos in the same folder. That said, it might warn you that there's not .git folder at the root level when this happens.
Hah no, you're not alone! Candidly, this is one of the top complaints from users. We're doing a lot of prompt engineering to be safe, but we can definitely do more. But the ones who take the leap of faith have found that it speeds up their workflows tremendously. Codebuff installs the right packages, sets up environments correctly, runs their scripts, etc. It feels magical because you can stay at a high level and focus on the real problems you're trying to solve.
If you're nervous about this, I'd suggest throwing Codebuff in a Docker container or even a separate instance with just your codebase.
Very excited for codebuff, its been a huge productivity boost for me! I've been putting it to use on a monorepo that has Go, Typescript, terraform and some sql and it always looks at the right files for the task. I like the UX way better than cursor - I like reviewing all changes at once and making minor tweaks when necessary. Especially for writing Go, i love being able to stick with Goland IDE while using codebuff.
Thanks for being one of our early users and dealing with our bugs! I love that we can fit into so many developers' workflows and support them where _they are_, as opposed to forcing them to use us for everything.
Oh, it was a tool call I originally implemented so that Codebuff could look up the probabilities of markets to help it answer user questions.
I thought it would be fun if you asked it about the chance of the election or maybe something about AI capabilities, it could back up the answer by citing a prediction market.
I've been using Codebuff for the last few weeks, and it's been really nice for working in my Elixir repo. And as someone who uses Neovim in the terminal instead of VS Code, it's nice to actually be able to have it live in the tmux split beside Neovim instead of having to switch to a different editor.
I have noticed some small oddities, like every now and then it will remove the existing contents of a module when adding a new function, but between a quick glance over the changes using the diff command and our standard CI suite, it's always pretty easy to catch and fix.
Any specific reason to choose the terminal as the interface? Do you plan to make it more extensible in the future? (sounds like this could be wrapped with an extension for any IDE, which is exciting)
Also, do you see it being a problem that you can't point it to specific lines of code? In Cursor you can select some lines and CMD+K to instruct an edit. This takes away that fidelity, is it because you suspect models will get good enough to not require that level of handholding?
Do you plan to benchmark this with swe-bench etc.?
We thought about making a VSCode extension/fork like everyone else, but decided that the future is coding agents that do most of the work for you.
The terminal is actually a great interface because it is so simple. It keeps the product focused to not have complex UI options. But also, we rarely thought we needed any options. It's enough to let the user say what they want in chat.
You can't point to specific lines, but Codebuff is really good at finding the right spot.
I actually still use Cursor to edit individual files because I feel it is better when you are manually coding and want to change just one thing there.
We do plan to do the SWE bench. It's mostly the new Sonnet 3.5 under the hood making the edits, so it should do about as well as Anthropic's benchmark for that, which is really high, 49%: https://www.anthropic.com/news/3-5-models-and-computer-use
Fun fact is that the new Sonnet was given two tools to do code edits and run terminal commands to reach this high score. That's pretty much what Codebuff does.
To add on, I know a lot of people see the terminal as cruft/legacy from the mainframe days. But it is a funny thing to look at tons of people's IDE setup and see that the one _consistent_ thing between them all is that they have a terminal nearby. It makes sense, too, since programs run in the terminal and you can only abstract so much to developers. And like James said, this sets us up nicely to build for a future of coding agents running around. Feels like a unique insight, but I dunno. I guess time will tell.
> I know a lot of people see the terminal as cruft/legacy from the mainframe days.
Hah. If you encounter people that think like this, run away because as soon as they finish telling you that terminals are stupid they inevitably want help configuring their GUI for k8s or git. After that, with or without a GUI, it turns out they also don’t understand version control / containers.
Are there any plans to add a sandbox? This seems cool, but it seems susceptible to prompt injection attacks when for example asking questions about a not necessarily trusted open source codebase.
I've been playing with Codebuff for a few days (building out some services with Node.js + Typescript) - been working beautifully! Feels like I'm watching a skilled surgeon at work.
A skilled surgeon is a great analogy! We actually instruct Codebuff to focus on making the most minimal edits, so that it does precisely what you want.
In Codebuff you don't have to manually specify any files. It finds the right ones for you! It also pulls more files to get you a better result. I think this makes a huge difference in the ergonomics of just chatting to get results.
Codebuff also will run commands directly, so you can ask it to write unit tests and run them as it goes to make sure they are working.
I think Aider does this to save tokens/money. It supports a lot of models so you can have Claude as your architect and another cheap model that does the coding.
Can you speak more to how efficiency towards context management works (to reduce token costs)? Or are you loading up context to the brim with each request?
I think managing context is the most important aspect of today's coding agents. We pick only files we think would be relevant to the user request and add those. We generally pull more files than Cursor, which I think is an advantage.
However, we also try to leverage prompt-caching as much as possible to lower costs and improve latency.
So we basically only add files over time. Once context gets too large, it will purge them all and start again.
> However, we also try to leverage prompt-caching as much as possible to lower costs and improve latency.
Interesting! That does have 5 minute expiry on Claude, and your users can use Codebuff in an unoptimal way. Do you have plans in aligning your users towards using the tool in a way that makes the most use of prompt caches?
brilliant - and thank you - so impressed with your work, i finally made an account to just comment - out of the box worked, a few minor glitches, but this is the start of awesome. keep doing what you are doing.
Your website has a serious issue. Trying to play the YouTube video makes the page slow down to a crawl, even in 1080p, while playing it on YouTube directly has no issue, even in 4K.
On the project itself, I don't really find it exciting at all, I'm sorry. It's just another wrapper for a 3rd party model, and the fact that you can 1) describe the entire workflow in 3 paragraphs, and 2) built it and launched it in around 4 months, emphasizes that.
Weird, thanks for flagging – we're just using a Youtube embed in an iframe but I'll take a look.
No worries if this isn't a good fit for you. You're welcome to try it out for free anytime if you change your mind!
FWIW I wasn't super excited when James first showed me the project. I had tried so many AI code editors before, but never found them to be _actually usable_. So when James asked me to try, I just thought I'd be humoring him. Once I gave it a real shot, I found Codebuff to be great because of its form factor and deep context awareness: CLI allows for portability and system integration that plugins or extensions really can't do. And when AI actually understands my codebase, I just get a lot more done.
Not trying to convince you to change your mind, just sharing that I was in your shoes not too long ago!
> CLI allows for portability and system integration that plugins or extensions really can't do
In the past 6 or 7 years I haven't written a single line of code outside of a JetBrains IDE. Same thing for all of my team (whether they use JetBrains IDEs or VS Code), and I imagine for the vast majority of developers.
This is not a convincing argument for the vast majority of people. If anything, the fact that it requires a tool OUTSIDE of where they write code is an inconvenience.
> And when AI actually understands my codebase, I just get a lot more done.
But Amazon Q does this without me needing to type anything to instruct it, or to tell it which files to look at. And, again, without needing to go out of my IDE.
Having to switch to a new tool to write code using AI is a huge deterrent and asking for it is a reckless choice for any company offering those tools. Integrating AI in tools already used to write code is how you win over the market.
I was thinking the same. My (admittedly old-ish) 2070 Super runs at 25-30% just looking at the landing page. Seems a bit crazy for a basic web page. I'm guessing it's the background animation.
The demos I see for these types of tools are always some toy project and doesn't reflect day to day work I do at all. Do you have any example PRs on larger more complex projects that have been written with codebuff and how much of that was human interactive?
The real problem I want someone to solve is helping me with the real niche/challenging portion of a PR, ex: new tiptap extension that can do notebook code eval, migrate legacy auth service off auth0, record and replay API GET requests and replay a % of them as unit tests, etc.
So many of these tools get stuck trying to help me "start" rather than help me "finish" or unblock the current problem I'm at.
Absolutely! Imaging setting a bunch of css styles through a long winded AI conversation, when you could have an IDE to do it in a few seconds. I don't need that.
The long tail of niche engineering problems is the time consuming bit now. That's not being solved at all, IMHO.
> ... setting a bunch of css styles through a long winded AI conversation
Any links on this topic you rate/could share?
+1; Ideally I want a tool I don't have to specify the context for. If I can point it via config files at my medium-sized codebase once (~2000 py files; 300k LOC according to `cloc`) then it starts to get actually usable.
Cursor Composer doesn't handle that and seems geared towards a small handful of handpicked files.
Would codebuff be able to handle a proper sized codebase? Or do the models fundamentally not handle that much context?
Yes. Natively, the models are limited to 200k tokens which is on the order of dozens of files, which is way too small.
But Codebuff has a whole preliminary step where it searches your codebase to find relevant files to your query, and only those get added to the coding agent's context.
That's why I think it should work up to medium-large codebases. If the codebase is too large, then our file-finding step will also start to fail.
I would give it a shot on your codebase. I think it should work.
What's the fundamental limitation to context size here? Why can't a model be fine-tuned per codebase, taking the entire code into context (and be continuously trained as it's updated)?
Forgive my naivety, I don't now anything about LLMs.
RAG is a well-known technique now, and to paraphrase Emily Bender[1], here are some reasons why it's not a solution.
The code extruded from the LLM is still synthetic code, and likely to contain errors both in the form of extra tokens motivated by the pre-training data for the LLM rather than the input texts AND in the form of omission. It's difficult to detect when the summary you are relying on is actually missing critical information.
Even if the set up includes the links to the retrieved documents, the presence of the generated code discourages users from actually drilling down and reading them.
This is still a framing that says: Your question has an answer, and the computer can give it to you.
1 https://buttondown.com/maiht3k/archive/information-literacy-...
We actually don't use RAG! It's not that good as you say.
We build a description of the codebase including the file tree and parsed function names and class names, and then just ask Haiku which files are relevant!
This works much better and doesn't require slowly creating an index. You can just run Codebuff in any directory and it works.
It sounds like it's arguably still a form of RAG, just where the retrieval is very different. I'm not saying that to knock your approach, just saying that it sounds like it's still the case where you're retrieving some context and then using that context to augment further generation. (I get that's definitely not what people think of when you say RAG though.)
Genuine question: at what point does the term RAG lose its meaning? Seems like LLMs work best when they have the right context, and that context must be pulled from somewhere for the LLM. But if that's RAG, then what isn't? Do you have a take on this? Been struggling to frame all this in my head, so would love some insight.
I hear you. This is actually a foundational idea for Codebuff. I made it to work within the large-ish codebase of my previous startup, Manifold Markets.
I want the demos to be of real work, but somehow they never seem as cool unless it's a neat front end toy example.
Here is the demo video I sent in my application to YC, which shows it doing real stuff: https://www.loom.com/share/fd4bced4eff94095a09c6a19b7f7f45c?...
It's pretty good for complex projects imo because codebuff can understand the structure of your codebase and which files to change to implement changes. It still struggles when there isn't good documentation, but it has helped me finish a number of projects
> It still struggles when there isn't good documentation
@Codebuff team, does it make sense to provide a documentation.md with exposition on the systems?
One cool thing you can do is a ask Codebuff to create these docs. In fact, we recommend it.
Codebuff natively reads any files ending in "knowledge.md", so you can add any extra info you want it to know to these files.
For example, to make sure Codebuff creates new endpoints properly, I wrote a short guide with an example on the three files you need to update, and put it in backend/api/knowledge.md. After that, Codebuff always create new endpoints correctly!
you can put the information into knowledge.md or [description].knowledge.md, but sometimes I can't find documentation and we're both learning as we go lmao
Kind of like "please describe the solution and I will write code to do it". That's not how programming works. Writing code and testing it against expectations to get to the solution, that's programming.
Great question – we struggled for a long time to put our demo together precisely for this reason. Codebuff is so useful in a practical setting, but we can't bore the audience with a ton of background on a codebase when we do demos, so we have to pick a toy project. Maybe in the future, we could start our demo with a half-built project?
Hopefully the demo on our homepage shows a little bit more of your day-to-day workflows than other codegen tools show, but we're all ears on ways to improve this!
To give a concrete example of usefulness, I was implementing a referrals feature in Drizzle a few weeks ago, and Codebuff was able to build out the cli app, frontend, backend, and set up db schema (under my supervision, of course!) because of its deep understanding of our codebase. Building the feature properly requires knowing how our systems intersect with one another and the right abstraction at each point. I was able to bounce back and forth with it to build this out. It felt akin to working with a great junior engineer, tbh!
EDIT: another user shared their use cases here! https://news.ycombinator.com/item?id=42079914
I've been using Codebuff (formerly manicode) for a few weeks. I think they have nailed the editing paradigm and I'm using it multiple times a day.
If you want to make a multi-file edit in cursor, you open composer, probably have to click to start a new composer session, type what you want, tell it which files it needs to include, watch it run through the change (seeing only an abbreviated version of the changes it makes), click apply all, then have to go and actually look at the real diff.
With codebuff, you open codebuff in terminal and just type what you want, and it will scan the whole directory to figure out which files to include. Then you can see the whole diff. It's way cleaner and faster for making large changes. Because it can run terminal commands, it's also really good at cleaning up after itself, e.g., removing files, renaming files, installing dependencies, etc.
Both tools need work in terms of reliability, but the workflow with Codebuff is 10x better.
Does this send code via your servers? If so, why? Nothing you've described couldn't be better implemented as a local service.
Could this tool get a command from the LLM which would result in file-loss? How would you prevent that?
We already have AIDE, Continue, Cody , Aider, Cursor.. Why this?
first time?
Noting Codebuff is manicode renamed.
It's become my go-to tool for handling fiddly refactors. Here’s an example session from a Rust project where I used it to break a single file into a module directory.
https://gist.github.com/cablehead/f235d61d3b646f2ec1794f656e...
Notice how it can run tests, see the compile error, and then iterate until the task is done? Really impressive.
For reference, this task used ~100 credits
Haha yes, here's the story of why we rebranded: https://manifold.markets/JamesGrugett/what-will-we-rename-ma...
Thanks for sharing! haxton was asking about practical use cases, I'll link them here!
Quality of code wise, is it worse or better than Cursor? I pay for Cursor now and it saves me a LOT of time to not copy files around. I actually still use the chatGPT/claude interfaces to code as well.
Cool, it's probably about the same, since we're both using the new Sonnet 3.5 for coding.
We might have a bit of an advantage because we pull more files as context so the edit can be more in the style of your existing code.
One downside to use pulling more context is we burn more tokens. That's partly why we have to charge $99 whereas cursor is $20 per month.
If its the same, hard to justify $100/month vs $20/month . I code mostly from vim so I'm searching for my vim/cli replacement while I still use both vim and Cursor.
Ah, but in Cursor you (mostly) have to manually choose files to edit and then approve all the changes.
With Codebuff, you just chat from the terminal. After trying it, I think you might not want to go back to Cursor haha.
That sounds cool and I like the idea, but definitely won't pay 5x. Maybe charge $30/month plus bring your own key. Let me know when you lower the price :)
It might sound small, but pulling in more context can make a huge difference – I remember one time Cursor completely hallucinated Prisma as part of our tech stack and created a whole new schema for us, whereas Codebuff knew we were already hooked up to Drizzle and just modified our existing schema. But like James said, we do use more tokens to do this, so pros & cons.
Allowing LLMs to execute unrestricted commands without human review is risky and insecure.
Yes, this is a good point. I think not asking to run commands is maybe the most controversial choice we've made so far.
The reason we don't ask for human review is simply: we've found that it works fine to not ask.
We've had a few hundred users so far and usually people are skeptical of this at first, but as they use it they find that they don't want it to ask for every command. It enables cool use cases where Codebuff and iterate by running tests, seeing the error, attempting a fix, and running them again.
If you use source control like git, I also think that it's very hard for things to go wrong. Even if it ran rm -rf from your project directory, you should be able to undo that.
But here's the other thing: it won't do that. Claude is trained to be careful about this stuff and we've further prompted it to be careful.
I think not asking to run commands is the future of coding agents, so I hope you will at least entertain this idea. It's ok if you don't want to trust it, we're not asking you to do anything you are uncomfortable with.
I am not afraid of rm -rf whole directory. I am afraid of other stuff that it can do to my machines. leak my ssh keys, cookies, persnal data, network devices, and making persistent modifications (malware) to my system. Or maybe inadvertently messing with my python version, or globally installing some library to mess up whole system.
> it won't do that. Claude is trained to be careful about this stuff and we've further prompted it to be careful.
Could you please explain a bit how you are sure about it?
It's mainly from experience. From when I set it up I didn't have the feature to ask whether to run commands. It has been rawdogging commands this whole time and has never been a problem for me.
I think we have many other users who are similar. To be fair, sometimes after watching it install packages with npm, people are surprised and say that they would have preferred that it asked. But usually this is just the initial reaction. I'm pretty confident this is the way forward.
Do you have any sandbox-like restrictions in place to ensure that commands are limited to only touching the project folder not any other places in the system?
We always reset the directory back to the project directory on each command, so that helps.
But we're open to adding more restrictions so that it can't for example run `cd /usr && rm -rf .`
How about executing commands in a VM (perhaps Firecracker)?
You are really missing out: https://github.com/e2b-dev/e2b
I don't see any sandbox usage in the demo video.
It's strange that all the closed models whose mentioned reasons for being closed is safety is allowing this, and banning the apps which allows for erotic roleplay all the time. Roleplay is significantly less dangerous than full shell control.
been using cline extension in vscode (which can execute commands and look at the output on terminal) and it's an incredibly adept sysadmin, cloud architect and data engineer. I like that cline lets you approve/decline execution requests and you can run it without sending the output which is safer from a data perspective.
It's cool to have this natively on the remote system though. I think a safer approach would be to compile a small binary locally that is multi-platform, and which has the command plus the capture of output to relay back, and transmit that over ssh for execution (like how MGMT config management compiles golang to static binary and sends it over to the remote node vs having to have mgmt and all it's deps installed on every system it's managing).
Could be low lift vs having a package, all it's dependencies and credentials running on the target system.
Are you an adept sysadmin, cloud architect, and/or data engineer?
It’s a weird catch-22 giving praise like that to LLMs.
If you are, then you might be able to intuit and fill in the gaps left my the LLM and not even know it.
And if you’re not, then how could you judge?
Not really much to do with that you were saying, really, just a thought I had.
I'd assume the person giving the praise is at least a bit of all 3.
> It’s a weird catch-22 giving praise like that to LLMs.
It's a bit asymmetrical though isn't it -- judging quality is in fact much easier than producing it.
> you might be able to intuit and fill in the gaps left my the LLM and not even know it
Just because you are able to fill gaps with it doesn't mean it's not good. With all of these tools you basically have to fill gaps. There are still differences between Cline vs Cursor vs Aider vs Codebuff.
Personally I've found Cline to be the best to date, followed by Cursor.
What if you have a microservice system with a repo-per-service setup, where to add functionality to a FE site you would have to edit code in three or four specific repos (FE site repo + backend service repo + API-client npm package repo + API gateway repo) out of hundreds of total repos?
Codebuff works on a local directory level, so it technically doesn't have to be a monorepo (though full disclaimer: our codebase is a monorepo and that's where we use it most). Most important thing is to make sure you have the projects in the same root directory so you can access them together. I've used it in a setup with two different repos in the same folder. That said, it might warn you that there's not .git folder at the root level when this happens.
This does seem to be suited to monorepo.
Yes, unfortunately, Codebuff will only read files within one directory (and sub-directories).
If you have multiple repos, you could create a directory that contains them all, and that should work pretty well!
am I the only one who is scared of "it can run any command in your terminal"?
Hah no, you're not alone! Candidly, this is one of the top complaints from users. We're doing a lot of prompt engineering to be safe, but we can definitely do more. But the ones who take the leap of faith have found that it speeds up their workflows tremendously. Codebuff installs the right packages, sets up environments correctly, runs their scripts, etc. It feels magical because you can stay at a high level and focus on the real problems you're trying to solve.
If you're nervous about this, I'd suggest throwing Codebuff in a Docker container or even a separate instance with just your codebase.
Very excited for codebuff, its been a huge productivity boost for me! I've been putting it to use on a monorepo that has Go, Typescript, terraform and some sql and it always looks at the right files for the task. I like the UX way better than cursor - I like reviewing all changes at once and making minor tweaks when necessary. Especially for writing Go, i love being able to stick with Goland IDE while using codebuff.
Thanks for being one of our early users and dealing with our bugs! I love that we can fit into so many developers' workflows and support them where _they are_, as opposed to forcing them to use us for everything.
Why is there stuff for Manifold Markets in the distributed package?
/codebuff/dist/manifold-api.js
https://www.npmjs.com/package/codebuff?activeTab=code
Haha, it's because I used to work on Manifold Markets!
Codebuff was originally called Manicode. We just renamed it this week actually.
There was meant to be a universe of "Mani" products. My other cofounder made Manifund, and there's a conference we made called Manifest!
but that doesn’t answer the question? are these test files?
Oh, it was a tool call I originally implemented so that Codebuff could look up the probabilities of markets to help it answer user questions.
I thought it would be fun if you asked it about the chance of the election or maybe something about AI capabilities, it could back up the answer by citing a prediction market.
Why not simply remove that dependency now?
Cruft that built up over time But you make a good point, I'll write a ticket to remove it soon.
Cause they wrote codebuff using codebuff.
They came out of Manifold. Though I acknowledge that doesn't really answer your question.
I've been using Codebuff for the last few weeks, and it's been really nice for working in my Elixir repo. And as someone who uses Neovim in the terminal instead of VS Code, it's nice to actually be able to have it live in the tmux split beside Neovim instead of having to switch to a different editor.
I have noticed some small oddities, like every now and then it will remove the existing contents of a module when adding a new function, but between a quick glance over the changes using the diff command and our standard CI suite, it's always pretty easy to catch and fix.
Thanks for using Codebuff! Yeah, these edit issues are annoying, but I'm confident we can reduce the error rate a lot in the coming weeks.
Love the demo video! Three quick questions:
Any specific reason to choose the terminal as the interface? Do you plan to make it more extensible in the future? (sounds like this could be wrapped with an extension for any IDE, which is exciting)
Also, do you see it being a problem that you can't point it to specific lines of code? In Cursor you can select some lines and CMD+K to instruct an edit. This takes away that fidelity, is it because you suspect models will get good enough to not require that level of handholding?
Do you plan to benchmark this with swe-bench etc.?
We thought about making a VSCode extension/fork like everyone else, but decided that the future is coding agents that do most of the work for you.
The terminal is actually a great interface because it is so simple. It keeps the product focused to not have complex UI options. But also, we rarely thought we needed any options. It's enough to let the user say what they want in chat.
You can't point to specific lines, but Codebuff is really good at finding the right spot.
I actually still use Cursor to edit individual files because I feel it is better when you are manually coding and want to change just one thing there.
We do plan to do the SWE bench. It's mostly the new Sonnet 3.5 under the hood making the edits, so it should do about as well as Anthropic's benchmark for that, which is really high, 49%: https://www.anthropic.com/news/3-5-models-and-computer-use
Fun fact is that the new Sonnet was given two tools to do code edits and run terminal commands to reach this high score. That's pretty much what Codebuff does.
To add on, I know a lot of people see the terminal as cruft/legacy from the mainframe days. But it is a funny thing to look at tons of people's IDE setup and see that the one _consistent_ thing between them all is that they have a terminal nearby. It makes sense, too, since programs run in the terminal and you can only abstract so much to developers. And like James said, this sets us up nicely to build for a future of coding agents running around. Feels like a unique insight, but I dunno. I guess time will tell.
> I know a lot of people see the terminal as cruft/legacy from the mainframe days.
Hah. If you encounter people that think like this, run away because as soon as they finish telling you that terminals are stupid they inevitably want help configuring their GUI for k8s or git. After that, with or without a GUI, it turns out they also don’t understand version control / containers.
Are there any plans to add a sandbox? This seems cool, but it seems susceptible to prompt injection attacks when for example asking questions about a not necessarily trusted open source codebase.
We might! You could also set up your project within a docker container pretty simply (Codebuff would be great at setting that up :P).
Really like the look of this interface. You're definitely onto something. Good work.
The product design is really thoughtful and thanks for sharing your story – Cannot wait to try this see you and see how you iterate on this!
Extra context length looks valuable! Excited to try this out!
Amazing stuff! The rebrand is great and it's cool to read the whole story!
I've been playing with Codebuff for a few days (building out some services with Node.js + Typescript) - been working beautifully! Feels like I'm watching a skilled surgeon at work.
A skilled surgeon is a great analogy! We actually instruct Codebuff to focus on making the most minimal edits, so that it does precisely what you want.
whooooot! it's been a wild ride thus far, but we've been super thrilled at how people are using it and can't wait for you all to try it out!
we've seen our own productivity increase tenfold – using codebuff to build buff our own code hah
let us know what you think!
Comparison with Aider?
Great question!
In Codebuff you don't have to manually specify any files. It finds the right ones for you! It also pulls more files to get you a better result. I think this makes a huge difference in the ergonomics of just chatting to get results.
Codebuff also will run commands directly, so you can ask it to write unit tests and run them as it goes to make sure they are working.
I think Aider does this to save tokens/money. It supports a lot of models so you can have Claude as your architect and another cheap model that does the coding.
> In Codebuff you don't have to manually specify any files.
Alright, I'm in.
Ah thanks, that's excellent! That is a massive issue for Aider; it was supposed to be solved, but last I tried I still had to do that manually.
Nice work!
That's why I like aider tbh. I know it's not going nuts on my repo.
> One user racked up a $500 bill by building out two Flutter apps in parallel.
Is that through the Enterprise plan?
Nope, if you go over the allotted credits on the $99 plan, then you pay per usage (with a 5% discount).
We actually ended up not charging this guy since there was a bug where we told him he got 50,000 credits instead of 10,000. Oops!
Can you speak more to how efficiency towards context management works (to reduce token costs)? Or are you loading up context to the brim with each request?
I think managing context is the most important aspect of today's coding agents. We pick only files we think would be relevant to the user request and add those. We generally pull more files than Cursor, which I think is an advantage.
However, we also try to leverage prompt-caching as much as possible to lower costs and improve latency.
So we basically only add files over time. Once context gets too large, it will purge them all and start again.
> However, we also try to leverage prompt-caching as much as possible to lower costs and improve latency.
Interesting! That does have 5 minute expiry on Claude, and your users can use Codebuff in an unoptimal way. Do you have plans in aligning your users towards using the tool in a way that makes the most use of prompt caches?
brilliant - and thank you - so impressed with your work, i finally made an account to just comment - out of the box worked, a few minor glitches, but this is the start of awesome. keep doing what you are doing.
Amazing, good to hear! What were the minor glitches you encountered? Would love to fix them up.
This is much needed! Gonna try this out. I haven't seen a good tool that lets me generate code via CLI.
The ergonomics of using unit tests + this to pass said unit tests is actually pretty good. Just tried it.
Does Codebuff / the tree sitter implementation support Svelte?
Yes, at least, partially. It will work, but maybe not as well as we don't parse out the function names from .svelte files.
I can add it if tree sitter adds support for Svelte. I haven't checked, maybe it already is supported?
I'm a big fan! It's better than cursor in many ways
Your comment is a bit suspicious given that your previous submissions are limited to manifold market links, and this tool came from that company.
Yes, he's the lead Manifold eng. Please discount appropriately.
Could you elaborate on those ways please?
Looks awesome! Great work team.
Couldn't get through the video, your keyboard sounds are very annoying.
Sorry Will try an external keyboard next time!
Your website has a serious issue. Trying to play the YouTube video makes the page slow down to a crawl, even in 1080p, while playing it on YouTube directly has no issue, even in 4K.
On the project itself, I don't really find it exciting at all, I'm sorry. It's just another wrapper for a 3rd party model, and the fact that you can 1) describe the entire workflow in 3 paragraphs, and 2) built it and launched it in around 4 months, emphasizes that.
Congrats on launch I guess.
Weird, thanks for flagging – we're just using a Youtube embed in an iframe but I'll take a look.
No worries if this isn't a good fit for you. You're welcome to try it out for free anytime if you change your mind!
FWIW I wasn't super excited when James first showed me the project. I had tried so many AI code editors before, but never found them to be _actually usable_. So when James asked me to try, I just thought I'd be humoring him. Once I gave it a real shot, I found Codebuff to be great because of its form factor and deep context awareness: CLI allows for portability and system integration that plugins or extensions really can't do. And when AI actually understands my codebase, I just get a lot more done.
Not trying to convince you to change your mind, just sharing that I was in your shoes not too long ago!
I would really rethink your value proposition.
> CLI allows for portability and system integration that plugins or extensions really can't do
In the past 6 or 7 years I haven't written a single line of code outside of a JetBrains IDE. Same thing for all of my team (whether they use JetBrains IDEs or VS Code), and I imagine for the vast majority of developers.
This is not a convincing argument for the vast majority of people. If anything, the fact that it requires a tool OUTSIDE of where they write code is an inconvenience.
> And when AI actually understands my codebase, I just get a lot more done.
But Amazon Q does this without me needing to type anything to instruct it, or to tell it which files to look at. And, again, without needing to go out of my IDE.
Having to switch to a new tool to write code using AI is a huge deterrent and asking for it is a reckless choice for any company offering those tools. Integrating AI in tools already used to write code is how you win over the market.
> Your website has a serious issue.
I was thinking the same. My (admittedly old-ish) 2070 Super runs at 25-30% just looking at the landing page. Seems a bit crazy for a basic web page. I'm guessing it's the background animation.