I use cursor and its tab completion; while what it can do is mind blowing, in practice I’m not noticing a productivity boost.
I find that ai can help significantly with doing plumbing, but it has no problems with connecting the pipes wrong. I need to double and triple check the updated code - or fix the resulting errors when I don’t do that. So: boilerplate and outer app layers, yes; architecture and core libraries, no.
Curious, is that a property of all ai assisted tools for now? Or would copilot, perhaps with its new models, offer a different experience?
I'm actually very curious why AI use is such a bi-modal experience. I've used AI to move multi thousand line codebases between languages. I've created new apps from scratch with it.
My theory is the willingness to baby sit and the modality. I'm perfectly fine telling the tool I use its errors and working side by side with it like it was another person. At the end of the day it can belt out lines of code faster than I, or any human, can and I can review code very quickly so the overall productivity boost has been great.
It does fundamentally alter my workflow. I'm very hands off keyboard when I'm working with AI in a way that is much more like working with someone or coaching someone to make something instead of doing the making myself. Which I'm fine with but recognize many developers aren't.
I use AI autocomplete 0% of the time as I found that workflow was not as effective as me just writing code, but most of my most successful work using AI is a chat dialogue where I'm letting it build large swaths of the project a file or parts of a file at a time, with me reviewing and coaching.
As a programmer of over 20 years - this is terrifying.
I'm willing to accept that I just have "get off my lawn" syndrome or something.
But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible. Whenever I sit down to write some code, be it a large implementation or a small function, I think about what other people (or future versions of myself) will struggle with when interacting with the code. Is it clear and concise? Is it too clever? Is it too easy to write a subtle bug when making changes? Have I made it totally clear that X is relying on Y dangerous behavior by adding a comment or intentionally making it visible in some other way?
It goes the other way too. If I know someone well (or their style) then it makes evaluating their code easier. The more time I spend in a codebase the better idea I have of what the writer was trying to do. I remember spending a lot of time reading the early Redis codebase and got a pretty good sense of how Salvatore thinks. Or altering my approaches to code reviews depending on which coworker was submitting it. These weren't things I were doing out of desire but because all non-trivial code has so much subtlety; it's just the nature of the beast.
So the thought of opening up a codebase that was cobbled together by an AI is just scary to me. Subtle bugs and errors would be equally distributed across the whole thing instead of where the writer was less competent (as is often the case). The whole thing just sounds like a gargantuan mess.
> The more time I spend in a codebase the better idea I have of what the writer was trying to do.
This whole thing of using LLMs to Code reminds me a bit of when Google Translate came out and became popular, right around the time I started studying Russian.
Yes, copying and pasting a block of Russian text produced a block of english text that you could get a general idea of what was happening. But translating from english to russian rarely worked well enough to fool the professor because of all the idioms, style, etc. Russian has a lot of ways you can write "compactly" with fewer words than english and have a much more precise meaning of the sentence. (I always likened russian to type-safe haskell and english to dynamic python)
If you actually understood Russian and read the text, you could uncover much deeper and subtle meaning and connections that get lost in translation.
If you went to russia today you could get around with google translate and people would understand you. But you aren't going to be having anything other than surface level requests and responses.
Coding with LLMs reminds me a lot of this. Yes, they produce something that the computer understands and runs, but the meaning and intention of what you wanted to communicate gets lost through this translation layer.
Coding is even worse because i feel like the intention of coding should never to be to belt out as many lines as possible. Coding has powerful abstractions that you can use to minimize the lines you write and crystalize meaning and intent.
This presumes that it will be real humans that have to “take care” of the code later.
A lot of the people that are hawking AI, especially in management, are chasing a future where there are no humans, because AI writes the code and maintains the code, no pesky expensive humans needed. And AI won’t object to things like bad code style or low quality code.
I think this is a bit short sighted, but I’m not sure how short. I suspect in the future, code will be something in between what it is today, and a build artifact. Do you have to maintain bytecode?
I was helping translate some UI text for a website from English to German, my mother tongue. I found that usually the machine came up with better translations than me.
I am a professional translator, and I have been using LLMs to speed up and, yes, improve my translations for a year and a half.
When properly prompted, the LLMs produce reasonably accurate and natural translations, but sometimes there are mistakes (often the result of ambiguities in the source text) or the sentences don’t flow together as smoothly as I would like. So I check and polish the translations sentence by sentence. While I’m doing that, I sometimes encounter a word or phrase that just doesn’t sound right to me but that I can’t think how to fix. In those cases, I give the LLMs the original and draft translation and ask for ten variations of the problematic sentence. Most of the suggestions wouldn’t work well, but there are usually two or three that I like and that are better than what I could come up with on my own.
Lately I have also been using LLMs as editors: I feed one the entire source text and the draft translation, and I ask for suggestions for corrections and improvements to the translation. I adopt the suggestions I like, and then I run the revised translation through another LLM with the same prompt. After five or six iterations, I do a final read-through of the translation to make sure everything is okay.
My guess is that using LLMs like this cuts my total translation time by close to half while raising the quality of the finished product by some significant but difficult-to-quantify amount.
This process became feasible only after ChatGPT, Claude, and Gemini got longer context windows. Each new model release has performed better than the previous one, too. I’ve also tried open-weight models, but they were significantly worse for Japanese to English, the direction I translate.
Although I am not a software developer, I’ve been following the debates on HN about whether or not LLMs are useful as coding assistants with much interest. My guess is that the disagreements are due partly to the different work situations of the people on both sides of the issue. But I also wonder if some of those who reject AI assistance just haven’t been able to find a suitable interactive workflow for using it.
This is a great analogy. I find myself thinking that by abstracting the entire design process when coding something using generative AI tools, you tend to lose track of fine details by only concentrating on the overall function.
Maybe the code works, but does it integrate well with the rest of the codebase? Do the data structures that it created follow the overall design principles for your application? For example, does it make the right tradeoffs between time and space complexity for this application? For certain applications, memory may be an issue and while code the may work, it uses too much memory to be useful in practice.
These are the kind of problems that I think about, and it aligns with your analogy. There is in fact something "lost through this translation layer".
> But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible
I heard a similar thing from a dude when I said I use it for bash scripts instead of copying and pasting things off StackOverflow.
He was a bit "get off my lawny" about the idea of running any code you didn't write, especially bash scripts in a terminal.
It is obviously the case that I didn't write most of the code in the world by a very large margin, but even not taking it to extremes if I'm working on a team and people are writing code how is it any different? Everyone makes mistakes, I make mistakes.
I think it's a bad idea to run things that you don't at least understand what it's going to do but the speed with which ChatGPT can produce, for example, gcloud shell commands to manage resources is lightning fast (all of which is very readable, just takes a while if you want to look it up and compose the commands yourself).
If your quality control method is "making sure there are no mistakes" then it's already broken regardless of where the code comes from. Me reviewing AI code is no different from me reviewing anyone else's code.
Me testing AI code using unit or integration tests is no different from testing anyone else's code, or my own code for that matter.
Multiple times in my s/w development career, I've had supervisors ask me why I am not typing code throughout the work day.
My response each time was along the lines of:
When I write code, it is to reify the part of a solution which
I understand. This includes writing tests to certify same.
There is no reason to do so before then.
> Me reviewing AI code is no different from me reviewing anyone else's code.
I take your point, and on the whole I agree with your post, but this point is fundamentally _not_ correct, in that if I have a question about someone else's code I can ask them about their intention, state-of-mind, and understanding at the time they wrote it, and (subjectively, sure; but I think this is a reasonable claim) can _usually_ detect pretty well if they are bullshitting me when they respond. Asking AI for explanations tends to lead to extremely convincing and confident false justifications rather than an admission of error or doubt.
However:
> Me testing AI code using unit or integration tests is no different from testing anyone else's code, or my own code for that matter.
I think it depends on the stakes of what you're building.
A lot of the concerns you describe make me think you work in a larger company or team and so both the organizational stakes (maintenance, future changes, tech debt, other people taking it over) and the functional stakes (bug free, performant, secure, etc) are high?
If the person you're responding to is cranking out a personal SaaS project or something they won't ever want to maintain much, then they can do different math on risks.
And probably also the language you're using, and the actual code itself.
Porting a multi-thousand line web SaaS product in Typescript that's just CRUD operations and cranking out web views? Sure why not.
Porting a multi-thousand line game codebase that's performance-critical and written in C++? Probably not.
That said, I am super fascinated by the approach of "let the LLM write the code and coach it when it gets it wrong" and I feel like I want to try that.. But probably not on a work project, and maybe just on a personal project.
> Porting a multi-thousand line web SaaS product in Typescript that's just CRUD operations and cranking out web views? Sure why not.
>
> Porting a multi-thousand line game codebase that's performance-critical and written in C++? Probably not.
From my own experience:
I really enjoy CoPilot to support me writing a terraform provider. I think this works well because we have hundreds of existing terraform providers with the same boilerplate and the same REST-handling already. Here, the LLM can crank out oodles and oodles of identical boilerplate that's easy to review and deal with. Huge productivity boost. Maybe we should have better frameworks and languages for this, but alas...
I've also tried using CoPilot on a personal Godot project. I turned it off after a day, because it was so distracting with nonsense. Thinking about it along these lines, I would not be surprised if this occurred because the high-level code of games (think what AAA games do in Lua, and well what Godot does in GDScript) tends to be small-volume and rather erratic within there. Here there is no real pattern to follow.
This could also be a cause for the huge difference in LLM productivity boosts people report. If you need Spring Boot code to put query params into an ORM and turn that into JSON, it can probably do that. If you need embedded C code for an obscure micro controller.. yeah, good luck.
> If you need embedded C code for an obscure micro controller.. yeah, good luck.
... or even information in the embedded world. LLMs need to generate something, o they'll generate code even when the answer is "no dude, your chip doesn't support that".
> they'll generate code even when the answer is "no dude, your chip doesn't support that".
This is precisely the problem. As I point out elsewhere[0], reviewing AI-generated code is _not_ the same thing as reviewing code written by someone else, because you can ask a human author what they were thinking and get a moderately-honest response; whereas an AI will confidently and convincingly lie to you.
I am quite interested in how LLMs would handle game development. Coming to game development from a long career in boutique applications and also enterprise software, game development is a whole different level of "boutique".
I think both because of the coupled, convoluted complexity of much game logic, and because there are fewer open source examples of novel game code available to train on, they may struggle to be as useful.
> A lot of the concerns you describe make me think you work in a larger company or team and so both the organizational stakes (maintenance, future changes, tech debt, other people taking it over) and the functional stakes (bug free, performant, secure, etc) are high?
The most financially rewarding project I worked on started out as an early stage startup with small ambitions. It ended up growing and succeeding far beyond expectations.
It was a small codebase but the stakes were still very high. We were all pretty experienced going into it so we each had preferences for which footguns to avoid. For example we shied away from ORMs because they're the kind of dependency that could get you stuck in mud. Pick a "bad" ORM, spend months piling code on top of it, and then find out that you're spending more time fighting it than being productive. But now you don't have the time to untangle yourself from that dependency. Worst of all, at least in our experience, it's impossible to really predict how likely you are to get "stuck" this way with a large dependency. So the judgement call was to avoid major dependencies like this unless we absolutely had to.
I attribute the success of our project to literally thousands of minor and major decisions like that one.
To me almost all software is high stakes. Unless it's so trivial that nothing about it matters at all; but that's not what these AI tools are marketing toward, are they?
Something might start out as a small useful library and grow into a dependency that hundreds of thousands of people use.
So that's why it terrifies me. I'm terrified of one day joining a team or wanting to contribute to an OSS project - only to be faced with thousands of lines of nonsensical autogenerated LLM code. If nothing else it takes all the joy out of programming computers (although I think there's a more existential risk here). If it was a team I'd probably just quit on the spot but I have that luxury and probably would have caught it during due diligence. If it's an OSS project I'd nope out and not contribute.
adding here due to some resonance with the point of view.. this exchange lacks crucial axes.. what kind of programming ?
I assume the parent-post is saying "I ported thousands of lines of <some C family executing on a server> to <python on standard cloud environments>. I could be very wrong but that is my guess. Like any data-driven software machinery, there is massive inherent bias and extra resources for <current in-demand thing> in this guess-case it is python that runs on a standard cloud environment with the loaders and credentials parts too perhaps.
Those who learned programming in the theoretic ways know that many, many software systems are possible in various compute contexts. And those working on hardware teams know that there are a lot of kinds of computing hardware. And to add another off-the-cuff idea, so much web interface ala 2004 code to bring to newer, cleaner setups.
I am not <emotional state descriptor> about this sea change in code generation, but actually code generation is not at all new. It is the blatent stealing and LICENSE washing of a generation of OSS that gets me, actually. Those code generation machines are repeating their inputs. No authors agreed and no one asked them, either.
> But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible.
I do think it is kind of crazy based on what I've seen. I'm convinced LLM is a game changer but I couldn't believe how stupid it can be. Take the following example, which is a spelling and grammar checker that I wrote:
If you click on the sentence, you can see that Claude-3.5 and GPT-4o cannot tell that GitHub is spelled correctly most of the time. It was this example that made me realize how dangerous LLM can be. The sentence is short but Claude-3.5 and GPT-4o just can't process it properly.
Having a LLM rewrite large swaths of code is crazy but I believe with proper tooling to verify and challenge changes, we can mitigate the risk.
I'm just speculating, but I believe GitHub has come to the same conclusion that I have, which is, all models can be stupid, but it is unlikely that all will be stupid at the same time.
AIs are not able to write Redis. That's not their job. AIs should not write complex high performance code that millions of users rely on. If the code does something valuable for a large number of people you can afford humans to write it.
AIs should write low value code that just repeats what's been done before but with some variations. Generic parts of CRUD apps, some fraction of typical frontends, common CI setups.
That's what they're good at because they've seen it a million times already. That category constitutes most code written.
This relieves human developers of ballpark 20% of their workload and that's already worth a lot of money.
Not the parent but this doesn’t seem mind changing, because what you describe is the normal/boring route to slightly better productivity using new tools without the breathless hype. And the 20% increase you mention of course depends a lot on what you’re doing, so for many types of work you’d be much closer to zero.
I’m curious about the claims of “power users” that are talking very excitedly about a brave new world. Are they fooling themselves, or trying to fool others, or working at jobs where 90% of their work is boilerplate drudgery, or what exactly? Inevitably it’s all of the above.. and some small percentage of real power users that could probably teach the rest of us cool stuff about their unique workflows. Not sure how to find the signal in all the noise though.
So personally, if I were to write “change my mind”, what I’d really mean is something like “convince me there are real power users already out there in the wild, using tools that are open to the public today”.
GP mentioned machine assisted translation of a huge code base being almost completely hands-off. If that were true and as easy as advertised then one might expect, for example, that it were trivial to just rewrite media wiki or Wordpress in rails or Django with a few people in a week. This is on the easier side of what I’d confidently label as a game-changingly huge productivity boost btw, and is a soft problem chosen because of the availability of existing code examples, mere translation over original work, etc. Not sure we’re there yet.
I can definitely see the value in letting AI generate low stakes code. I'm a daily CoPilot user and, while I don't let it generate implementations, the suggestions it gives for boilerplate-y things is top notch. Love it as a tool.
My major issue with your position is that, at least in my experience, good software is the sum of even the seemingly low risk parts. When I think of real world software that people rely on (the only type I care about in this context) then it's hard to point a finger at some part of it and go "eh, this part doesn't matter". It all matters.
The alternative, I fear, is 90% of the software we use exhibiting subtle goofy behavior and just being overall unpleasant to use.
I guess an analogy for my concern is what it would look like if 60% of every film was AI generated using the models we have today. Some might argue that 60% of all films are low stakes scenes with simple exposition or whatever. And then remaining 40% are the climax or other important moments. But many people believe that 100% of the film matters - even the opening credits.
And even if none of that were an issue: in my experience it's very difficult to assess what part of an application will/won't be low/high stakes. Imagine being a tech startup that needs to pivot your focus toward the low stakes part of the application that the LLM wrote.
I think your concept of ‘what the AI wrote’ is too large. There is zero chance my one line copilot or three line cursor tab completions are going to have an effect on the overall quality of my codebase.
What it is useful for is doing exactly the things I already know need to happen, but don’t want to spend the effort to write out (at least, not having to do it is great).
Since my brain and focus aren’t killed by writing crud, I get to spend that on more useful stuff. If it doesn’t make me more effective, at least it makes my job more enjoyable.
I'm with you. I use Copilot every day in the way you're describing and I love it. The person I was responding to is claiming to code "hands off" and let the AI write the majority of the software.
I have to disagree. If there’s that much boilerplate floating around then the tooling should be improved. Pasting over inefficiency with sloppier inefficiency is just a pure waste.
You still use type systems, tests, and code review.
For a lot of use cases it's powerful.
If you ask it to build out a brand new system with a complex algorithm or to perform a more complex refactoring, it'll be more work correcting it than doing it yourself.
But that malformed JSON document with the weird missing quotation marks (so the usual formatters break), and spaces before commas, and the indentation is wild... Give it to an LLM.
Or when you're writing content impls for a game based on a list of text descriptions, copy the text into a block comment. Then impl 1 example. Then just sit back and press tab and watch your profits.
The (mostly useless boilerplate “I’m basically just testing my mocks”) tests are being written by AI too these days.
Which is mildly annoying as a lot of those tests are basically just noise rather than useful tools. Humans have the same problem, but current models are especially prone to it from what I’ve observed
And not enough devs are babysitting the AI to make sure the test cases are useful, even if they’re doing so for the original code it produced
There are very few tutorials on how to do testing and I don't think I have ever seen one that was great. Compared to general coding stuff where there's great tutorials available for all the most common things.
So I think quality testing is just not in the training data at anywhere close to the quantity needed.
> Whenever I sit down to write some code, be it a large implementation or a small function, I think about what other people (or future versions of myself) will struggle with when interacting with the code. Is it clear and concise? Is it too clever? Is it too easy to write a subtle bug when making changes? Have I made it totally clear that X is relying on Y dangerous behavior by adding a comment or intentionally making it visible in some other way?
> It goes the other way too. If I know someone well (or their style) then it makes evaluating their code easier. The more time I spend in a codebase the better idea I have of what the writer was trying to do.
What I believe you are describing is a general definition of "understanding", which I am sure you are aware. And given your 20+ year experience, your summary of:
> So the thought of opening up a codebase that was cobbled together by an AI is just scary to me. Subtle bugs and errors would be equally distributed across the whole thing instead of where the writer was less competent (as is often the case).
Is not only entirely understandable (pardon the pun), but to be expected as algorithms employed lack the crucial bit which you identify - understanding.
> The whole thing just sounds like a gargantuan mess.
As it does to most whom envision having to live with artifacts produced by a statistical predictive text algorithm.
> Change my mind.
One cannot because understanding, as people know it, is intrinsic to each person by definition. It exists as a concept within the person whom possesses it and is defined entirely by said person.
> But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible.
I think this is where the bimodality comes from. When someone says "I used AI to refactor 3000 loc" some take it to mean they used AI in small steps as an accellerator, and others take it to mean a direct copy/paste, fix compile errors and move on.
Treat AI like a mid level engineer that you are pair programming with, who can type insanely fast. Move in small steps. Read through it's code after each small iteration. Ask it to fix things (or fix them yourself if quick and easy). Brainstorm ideas with it etc etc.
Unit, integration, e2e, types and linters would catch most of the things you mention.
Not every software is mission critical, often the most important thing is to go as fast and possible and iterate very quickly. Good enough is better than very good in many cases.
I have 10 years professional experience and I've been writing code for 20 years, really with this workflow I just read and review significantly more code and I coach it when it structures or styles something in a way I don't like.
I'm fully in control and nothing gets committed I haven't read its an extension of me at that point.
Edit: I think the issues you've mentioned typically apply to people too and the answer is largely the same. Talk, coach, put hard fixes in like linting and review approvals.
> As a programmer of over 20 years - this is terrifying.
>
> I'm willing to accept that I just have "get off my lawn" syndrome or something.
>
> But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible.
My first thought was that I disagree (though I don't use or like this in-IDE AI stuff) because version control. But then the way people use (or can't use) SVC 'terrifies' me anyway, so maybe I agree? It would be fine correctly handled, but it won't be, sort of thing.
> But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible.
People felt the same about compilers for a long time. And justifiably so, the idea that compilers are reliable is quite a new one, finding compilers bugs used to be pretty common. (Those experimenting with newer languages still get to enjoy the fun of this!)
How about other code generation tools? Presumably you don't take much umbrage with schema generators? Or code generators that take a scheme and output library code (OpenAPI, Protocol buffers, or even COM)? Those can easily take a few dozen lines of input and output many thousands of LoC, and because they are part of an automated pipeline, even if you do want to fix the code up, any fixes you make will be destroyed on the next pipeline run!
But there is also a LOT of boring boilerplate code that can be automated.
For example, the necessary code to create a new server, attach a JSON schema to a POST endpoint, validate a bearer token, and enable a given CORS config is pretty cut and dry.
If I am ramping up on a new backend framework, I can either spend hours learning the above and then copy and paste it forever more into each new project I start up, or I can use an AI to crap the code out for me.
(Actually once I was setting up a new server and I decided to not just copy and paste and to do it myself, I flipped the order of two `use` directives and it cost me at least 4 hours to figure out WTF was wrong....)
> As a programmer of over 20 years
I'm almost up there, and my view is that I have two modes of working:
1. Super low level, where my intimate knowledge of algorithms, the language and framework I'm using, of CPU and memory constraints, all come together to let me write code that is damn near magical.
2. Super high level, where I am architecting a solution using design patterns and the individual pieces of code are functionally very simple, and it is how they are connected together that really matters.
For #1, eh, for some popular problems AI can help (popular optimizations on Stack Overflow).
For #2, AI is the most useful, because I have already broken the problem down into individual bite size testable nuggets. I can have the AI write a lot of the boilerplate, and then integrate the code within the larger, human architected, system.
> So the thought of opening up a codebase that was cobbled together by an AI is just scary to me.
The AI didn't cobble together the system. The AI did stuff like "go through this array and check the ID field of each object and if more than 3 of them are null log an error, increment the ExcessNullsEncountered metric counter, and return an HTTP 400 error to the caller"
Edit: This just happened
I am writing a small Canvas game renderer, and I am having an issue with text above a character's head renders off the canvas. So I had Cursor fix the function up to move text under a character if it would have been rendered above the canvas area.
I was able to write the instructions out to Cursor faster than I could have found a pencil and paper to sketch out what I needed to do.
How do tests account for cases where I'm looking at a 100 line function that could have easily been written in 20 lines with just as much, if not more, clarity?
It reminds me of a time (long ago) when the trend/fad was building applications visually. You would drag and drop UI elements and define logic using GUIs. Behind the scenes the IDE would generate code that linked everything together. One of the selling points was that underneath the hood it's just code so if someone didn't have access to the IDE (or whatever) then they could just open the source and make edits themselves.
It obviously didn't work out. But not because of the scope/scale (something AI code generation solves) but because, it turns out, writing maintainable secure software takes a lot of careful thought.
I'm not talking about asking an AI to vomit out a CRUD UI. For that I'm sure it's well suited and the risk is pretty low. But as soon as you introduce domain specific logic or non-trivial things connected to the real world - it requires thought. Often times you need to spend more time thinking about the problem than writing the code.
I just don't see how "guidance" of an LLM gets anywhere near writing good software outside of trivial stuff.
> How do tests account for cases where I'm looking at a 100 line function that could have easily been written in 20 lines with just as much, if not more, clarity?
That’s not a failure of the AI writing that 100 line monstrosity, it’s a failure of you deciding to actually use the thing.
If you know what 20 lines are necessary and the AI doesn’t output that, why would you use it?
> How do tests account for cases where I'm looking at a 100 line function that could have easily been written in 20 lines with just as much, if not more, clarity?
If the function is fast to evaluate and you have thorough coverage by tests, you couod iterate on an LLMs that aims to compress it down to a simpler / shorter version that behaves identical to the original function. Of course brevity for the sake of brevity can lead to less code that is not always more clear or simpler to understand than the original —LLMs are very good at mimicing code style, so show them a lot of your own code and ask them to mimic it and you may be surprized.
Finally found a comment down here that I like. I'm also with the notion of tests and also iterating until you get to a solution you like. I also don't see anything particularly "terrifying" that many other comments suggest.
At the end of the day, we're engineers that write complex symbols on a 2d canvas, for something that is (ultimately, even if the code being written is machine to machine or something) used for some human purpose.
Now, if those complex symbols are readable, fully covered in tests, and meets requirements / specifications, I don't see why I should care if a human, an AI, or a monkey generated those symbols. If it meets the spec, it meets the spec.
Seems like most people in these threads are making arguments against others who are describing usage of these tools in a grossly incorrect manner from the get go.
I've said it before in other AI threads that I think (at least half?) of the noise and disagreement around AI generated code is like a bunch of people trying to use a hammer when they needed a screwdriver and then complaining that the hammer didnt work like a screwdriver!!! I just don't get it. When you're dealing with complex systems, i.e, reality, these tools (or any tool for that matter) will never work like a magic wand.
Depending on whether I'm using LLMs from my Emacs or via a tool like Aider, I either review and manually merge offered modifications as diffs (in editor), or review the automatically generated commits (Aider). Either way, I end up reading a lot of diffs and massaging the LLM output on the fly, and nothing that I haven't reviewed gets pushed to upstream.
I mean, people aren't seriously pushing unreviewed LLM-generated code to production? Current models aren't good enough for that.
The most likely explanation is that the code you are writing has low information density and is stringing things together the same way many existing apps have already done.
That isn’t a judgement but trying to use the ai code completion tools for complex systems tasks is almost always a disaster.
I'm not sure how many people are like me, but my attempts to use Copilot have largely been the context of writing code as usual, occasionally getting end-of-line or handful-of-lines completions from it.
I suspect there's probably a bigger shift needed, but I haven't seen anyone (besides AI "influencers" I don't trust..?) showing what their day-to-day workflows look like.
Is there a Vimcasts equivalent for learning the AI editor tips and tricks?
The autocomplete is somewhere between annoying and underwhelming for me, but the chat is super useful. Being able to just describe what you're thinking or what you're trying to do and having a bespoke code sample just show up (based on the code in your editor) that you can then either copy/paste in, cherry-pick from or just get inspired by, has been a great productivity booster..
Treat it like a pair programmer or a rubber duck and you might have a better experience. I did!
> I'm actually very curious why AI use is such a bi-modal experience
I think it's just that it's better at some things than others. Lucky for people who happen to be working in python/node/php/bash/sql/java probably unlucky for people writing Go and Rust (I'm hypothesising because I don't know Go or Rust nor have I ever used them but when the AI doesn't know something it REALLY doesn't know it, like it goes from being insanely useful to utterly useless).
> I use AI autocomplete 0% of the time as I found that workflow was not as effective as me just writing code, but most of my most successful work using AI is a chat dialogue where I'm letting it build large swaths of the project a file or parts of a file at a time, with me reviewing and coaching.
Me too, the way I use it is more like pair programming.
> I'm perfectly fine telling the tool I use its errors and working side by side with it like it was another person.
This is key. Traditional computing systems are deterministic machines, but AI is a probabilistic machine. So the way you interact and the range, precision, and perspective of the output stretches over a different problem/solution space.
Interesting that you find the conversational approach effective. For me, I'd say 9 out of 10 code conversations get stuck in a loop with me telling the AI the next suggested iteration didn't actually change anything or changed it back to something that was already broken. Do you not experience that so often, of do you have a way to escape that?
If you're doing something that appears in it's training model a lot, like building a twitter clone, then it is great. If you're using something brand new like react router 7 then it makes mistakes
My theory is grammatical correctness and specificity. I see a lot of people prompt like this:
"use python to write me a prog that does some dice rolls and makes a graph"
Vs
"Create a Python program that generates random numbers to simulate a series of dice rolls. Export a graph of the results in PNG format."
Information theory requires that you provide enough actual information. There is a minimum amount of work to supply the input. Otherwise, the gaps will get filled in with noise, working, what you want, or not.
For example, maybe someday you could say "write me an OS" and it would work. However, to get exactly what you want, you still have to specify it. You can only compress so far.
I agree. I am in a very senior role and find that working with AI the same way you do I am many times more productive. Months of work becomes days or even hours of work
Have you tried using chatgpt/etc as a starting point when you're unfamiliar with something? That's where it really excels for me, I can go crazy fast from 0 to ~30 (if we call 60 mvp). For example, the other day I was trying to stream some pcm audio using webaudio and it spit out a mostly functional prototype for me in a few minutes of trying. For me to read through msdn and get to that point would've taken an hour or two, and going from the crappy prototype as a starting point to read up on webaudio let me get an mvp in ~15 mins. I rarely touch frontend web code so for me these tools are super helpful.
On the other hand, I find it just wastes my time in more typical tasks like implementing business logic in a familiar language cause it makes up stdlib apis too often.
This is about the only use case I found it helpful for - saving me time in research, not in coding.
I needed to compare compression ratios of a certain text in a language, and it actually came up with something nice and almost workable. It didn't compile but I forgot why now, I just remember it needing a small tweak. That saved me having to track down the libraries, their APIs, etc.
However, when it comes to actually doing data structures or logic, I find it quicker to just do it myself than to type out what I want to do, and double check its work.
That's a very important caveat. In our modern economy it's difficult to not be a shill in some way, shape, or form, even if you don't quite realize it consciously. It's honestly one of the most depressing things about the stock market.
Holding stock is not being a "happy customer". I may be happy with the headset that I bought, but the difference is that I don't make money if you buy an identical one.
I wasnt talking about holding stock, I was responding to this comment you said:
> In our modern economy it's difficult to not be a shill in some way, shape, or form, even if you don't quite realize it consciously.
Oxford dictionary defines a shill as
"an accomplice of a confidence trickster or swindler who poses as a genuine customer to entice or encourage others."
So the difference between someone shilling and being a satisfied customer is an intent to decieve. How is it "difficult to not pose as a genuine customer to entice or encourage others" ?
I credit my past interest in cryptocurrencies for educating me about the essence of the stock market in its purest form. And in fact there are painful parallels with the AI bubble.
It's the subtle errors that are really difficult to navigate. I got burned for about 40 hours on a conditional being backward in the middle of an otherwise flawless method.
The apparent speed up is mostly a deception. It definitely helps with rough outlines and approaches. But, the faster you go, the less you will notice the fine details, and the more assumptions you will accumulate before realizing the fundamental error.
I'd rather find out I was wrong within the same day. I'd probably have written some unit tests and played around with that function a lot more if I had handcrafted it.
When I am able ask a very simple question of an LLM which then prevents me having to context-switch to answer the same simple question myself; this is a big time saver for me but hard-to-quantify.
Anything that reduces my cognitive load when the pressure is on is a blessing on some level.
You could make the same argument for any non-AI driven productivity tool/technique. If we can't trust the user to determine what is and is not time-saving then time-saving isn't a useful thing to discuss outside of an academic setting.
My issue with most AI discussions is they seem to completely change the dimensions we use to evaluate basic things. I believe if we replaced "AI" with "new useful tool" then people would be much more eager to adopt it.
What clicked for me is when I started treating it more like a tool and less like some sort of nebulous pandora's box.
Now to me it's no different than auto completing code, fuzzy finding files, regular expressions, garbage collection, unit testing, UI frameworks, design patterns, etc. It's just a tool. It has weaknesses and it has strengths. Use it for the strengths and account for the weaknesses.
Like any tool it can be destructive in the hands of an inexperienced person or a person who's asking it to do too much. But in the hands of someone who knows what they're doing and knows what they want out of it - it's so freakin' awesome.
Sorry for the digression. All that to say that if someone believes it's a productivity boost for them then I don't think they're being misled.
Except actual studies objectively show efficiency gains, more with junior devs, which make sense. So no, it's not a "deception" but it is often overstated in popular media.
And anecdotes are useless. If you want to show me improved studies justifying your claim great, but no I don't value random anecdotes. There are countless conflicting anecdotes (including my own).
I find the opposite, the more senior the more value they offer as you know how to ask the right questions, how to vary the questions and try different tact’s and also observe errors or mistakes
Cognitive load is something people always leave out. I can fuckin code drunk with these things. Or just increase stamina to push farther than I would writing every single line.
Exactly, 1 step forward, 1 step backward. Avoiding edge cases is something that can't be glossed over, and for that I need to carefully review the code. Since I'm accountable for it, and can't skip this part anyway, I'd rather review my own than some chatbot's.
That’s the thing, isn’t it? The craft of programming in the small is one of being intimate with the details, thinking things through conscientiously. LLMs don’t do that.
I find that it depends very heavily on what you're up to. When I ask it to write nix code it'll just flat out forget how the syntax works half way though. But if I want it to troubleshoot an emacs config or wield matplotlib it's downright wizardly, often including the kind of thing that does indicate an intimacy with the details. I get distracted because I'm then asking it:
> I un-did your change which made no sense to me and now everything is broken, why is what you did necessary?
I think we just have to ask ourselves what we want it to be good at, and then be diligent about generating decades worth of high quality training material in that domain. At some point, it'll start getting the details right.
What languages/toolkits are you working with that are less than 10 years old?
Anyhow, it seems to me like it is working. It's just working better for the really old stuff because:
- there has been more time for training data to accumulate
- some of it predates the trend of monetizing data, so there was less hoarding and more sharing
- we used to have mailing lists and now we have discourse/slack. The former makes a better training dataset
It may be that the hard slow way is the only way to get good results. If the modern trends re: products don't have the longevity/community to benefit from it, maybe we should fix that.
I use Chatgpt for coding / API questions pretty frequently. It's bad at writing code with any kind of non-trivial design complexity.
There have been a bunch of times where I've asked it to write me a snippet of code, and it cheerfully gave me back something that doesn't work for one reason or another. Hallucinated methods are common. Then I ask it to check its code, and it'll find the error and give me back code with a different error. I'll repeat the process a few times before it eventually gets back to code that resembles its first attempt. Then I'll give up and write it myself.
As an example of a task that it failed to do: I asked it to write me an example Python function that runs a subprocess, prints its stdout transparently (so that I can use it for running interactive applications), but also records the process's stdout so that I can use it later. I wanted something that used non-blocking I/O methods, so that I didn't have to explicitly poll every N milliseconds or something.
Honestly I find that when GPT starts to lose the plot it's a good time to refactor and then keep on moving. "Break this into separate headers or modules and give me some YAML like markup with function names, return type, etc for each file." Or just use stubs instead of dumping every line of code in.
If it takes almost no cognitive energy, quite a while. Even if it's a little slower than what I can do, I don't care because I didn't have to focus deeply on it and have plenty of energy left to keep on pushing.
I'm constantly having to go back and tell the AI about every mistake it makes and remind it not to reintroduce mistakes that were previously fixed. "no cognitive energy" is definitely not how I would describe that experience.
As my mother used to say, "I love work. I could watch it all day!"
I can see where you are coming from.
Maintaining a better creative + technical balance, instead of see-sawing. More continuous conscious planning, less drilling.
Plus the unwavering tireless help of these AI's seems psychologically conducive to maintaining one's own motivation. Even if I end up designing an elaborate garden estate or a simpler better six-axis camera stabilizer/tracker, or refactoring how I think of primes before attempting a theorem, ... when that was not my agenda for the day. Or any day.
Why aren't you writing unit tests just because AI wrote the function? Unit tests should be written regardless of the skill of the developer. Ironically, unit tests are also one area where AI really does help move faster.
High level design, rough outlines and approaches, is the worst place to use AI. The other place AI is pretty good is surfacing api call or function calls you might not know about if you're new to the language. Basically, it can save you a lot of time by avoiding the need for tons of internet searching in some cases.
The fact that you think "change detection" tests offer zero value speaks volumes. Those may well be the most important use of unit tests. Getting the function correct in the first place isn't that hard for a senior developer, which is often why it's tempting to skip unit tests. But then you go refactor something and oops you broke it without realizing it, some boring obvious edge case, or the like.
These tests are also very time consuming to write, with lots of boilerplate that AI is very good at writing.
> With minimal guidance[, LLM-based systems] put out pretty sensible tests.
Yes and no. They get out all the initial annoying boilerplate of writing tests out of the way, and the tests end up being mostly decent on the surface, but I have to manually tweak the behavior and write most of the important parts myself, especially for non-trivial tricky scenarios.
However, I am not saying this as a point against LLMs. The fact that they are able to get a good chunk of the boring boilerplate parts of writing unit tests out of the way and let me focus on the actual logic of individual tests has been noticeably helpful to me, personally.
I only use LLMs for the very first initial phase of writing unit tests, with most of the work still being done by me. But that initial phase is the most annoying and boring part of the process for me. So even if I still spend 90% of the time writing code manually, I still am very glad for being able to get that initial boring part out of the way quickly, without wasting my mental effort cycles on it.
If it wants to complete what I wanted to type anyway, or something extremely similar, I just press tab, otherwise I type my own code.
I'd say about 70% of individual lines are obvious enough if you have the surrounding context that this works pretty well in practice. This number is somewhat lower in normal code and higher in unit tests.
Another use case is writing one-off scripts that aren't connected to any codebase in particular. If you're doing a lot of work with data, this comes in very handy.
Something like "here's the header of a CSV file", pass each row through model x, only pass these three fields, the model will give you annotations, put these back in the csv and save, show progress, save every n rows in case of crashes, when the output file exists, skip already processed rows."
I'm not (yet) convinced by AI writing entire features, I tried that a few times and it was very inconsistent with the surrounding codebase. Managing which parts of the codebase to put in its context is definitely an art though.
It's worth keeping in mind that this is the worst AI we'll ever have, so this will probably get better soon.
I don't close my eyes and do whatever it tells me to do. If I think I know better I don't "turn right at the next set of lights" I just drive on as I would have before GPS and eventually realise that I went the wrong way or the satnav realises there was a perfectly valid 2nd/3rd/4th path to get to where I wanted to go.
I find chatgpt incredibly useful for writing scripts against well-known APIs, or for a "better stackoverflow". Things like "how do I use a cursor in sql" or "in a devops yaml pipeline, I want to trigger another pipeline. How do I do that?".
But working on our actual codebase with copilot in the IDE (Rider, in my case) is a net negative. It usually does OK when it's suggesting the completion of a single line, but when it decides to generate a whole block it invariably misunderstands the point of the code. I could imagine that getting better if I wrote more descriptive method names or comments, but the killer for me is that it just makes up methods and method signatures, even for objects that are part of publicly documented frameworks/APIs.
I love your framing of it as a "better stackoverflow." That's so true. However, I feel like some of our complaints about accuracy and hidden bugs are temporary pain (12-36) months before the tools truly become mind-blowing productivity multipliers.
Same here. If you need to lookup how to do something in an api I find it much faster to use chatgpt than to try to search through the janky official docs or in some github examples folder. Chatgpt is basically documentation search 2.0.
I haven't used Cursor, but I use Aider with Sonnet 3.5 and also use Copilot for "autocomplete".
I'd highly recommend reading through Aider's docs[0], because I think it's relevant for any AI tool you use. A lot of people harp on prompting, and while a good prompt is important I often see developers making other mistakes like not providing context that's good, correct, or even too much[1].
When I find models are going on the wrong path with something, or "connecting the pipes wrong", I often add code comments that provide additional clarity. Not only does this help future me/devs, but the more I steer AI towards correct results, the fewer problems models seem to have going forward.
Everybody seems to be having wildly different experiences using AI for coding assistance, but I've personally found it to be a big productivity boost.
Totally agree that heavy commenting is the best convention for helping the assistant help you best. I try to comment in a way that makes a file or function into a "story" or kind of a single narrative.
That's super interesting, I've been removing a lot of the redundant comments from the AI results. But adding new more explanatory ones that make it easier for both AI and humans to understand the code base makes a lot of sense in my head.
I was big on writing code to be easy to read for humans, but it being easy to read for AI hasn't been a large concern of mine.
> SpaceX's advancements are impressive, from rocket blow up to successfully catching the Starship booster.
That felt like it was LLM generated since that doesn't have anything to do with the subject being discussed. Not only it's on a different industry but it's a completely different set of problems. We know what's involved in catching a rocket. It's a massive engineering challenge yes, but we all know it can be done(whether or not it makes sense or is economically viable are different issues).
Even going to the Moon – which was a massive project and took massive focus from an entire country to do – was a matter of developing the equipment, procedures, calculations (and yes, some software). We knew back then it could be done, and roughly how.
Artificial intelligence? We don't know enough about "intelligence". There isn't even a target to reach right now. If we said "resources aren't a problem, let's build AI", there isn't a single person on this planet that can tell you how to build such an AI or even which technologies need to be developed.
More to the point, current LLMs are able to probabilistically generate data based on prompts. That's pretty much it. They don't "know" anything about what they are generating, they can't reason about it. In order for "AI" to replace developers entirely, we need other big advancements in the field, which may or may not come.
> Artificial intelligence? We don't know enough about "intelligence".
The problem I have with this objection is that it, like many discussions, conflates LLMs (glorified predictive text) and other technologies currently being referred to as AI, with AGI.
Most of these technologies should still be called machine learning as they aren't really doing anything intelligent in the sense of general intelligence. As you say yourself: they don't know anything. And by inference, they aren't reasoning about anything.
Boilerplate code for common problems, and some not so common ones, which is what LLMs are getting pretty OK at and might in the coming years be very good at, is a definable problem that we understand quite well. And much as we like to think of ourselves as "computer scientists", the vast majority of what we do boils down to boilerplate code using common primitives, that are remarkably similar across many problem domains that might on first look appear to be quite different, because many of the same primitives and compound structures are used. The bits that require actual intelligence are often quite small (this is how I survive as a dev!), or are away from the development coalface (for instance: discovering and defining the problems before we can solve them, or describing the problem & solution such that someone or an "AI" can do the legwork).
> we need other big advancements in the field, which may or may not come.
I'm waiting for an LLM being guided to create a better LLM, and eventually down that chain a real AGI popping into existence, much like the infinite improbability drive being created by clever use of a late version finite improbability generator. This is (hopefully) many years (in fact I'm hoping for at least a couple of decades so I can be safely retired or nearly there!) from happening, but it feels like such things are just over the next deep valley of disillusionment.
Except cursor is the fireworks based on black powder here. It will look good, but as a technology to get you to the moon it seems to look like a dead end. NOTHING (of serious science) seems to indicate LLMs being anything but a dead end with the current hardware capabilites.
So then I ask: What, in qualitative terms, makes you think AI in the current form will be capable of this in 5 or 10 years? Other than seeing the middle of what seems to be an S-curve and going «ooooh shiny exponential!»
> NOTHING (of serious science) seems to indicate LLMs being anything but a dead end with the current hardware capabilites.
In the same sense that black powder sucks as a rocket propellant - but it's enough to demonstrate that iterating on the same architecture and using better fuels will get you to the Moon eventually. LLMs of today are starting points, and many ideas for architectural improvements are being explored, and nothing in serious science suggests that will be a dead end any time soon.
If you look at LLM performance on benchmarks, they keep getting better at a fast rate.[1]
We also now have models of various sizes trained in general matters, and those can now be tuned or fine-tuned to specific domains. The advances in multi-modal AI are also happening very quickly as well. Model specialization, model reflection (chain of thought, OpenAI's new O1 model, etc.) are also undergoing rapid experimentation.
Two demonstrable things that LLMs don't do well currently, are (1) generalize quickly to out-of-distribution examples, (2) catch logic mistakes in questions that look very similar to training data, but are modified. This video talks about both of these things.[2]
I think I-JEPA is a pretty interesting line of work towards solving these problems. I also think that multi-modal AI pushes in a similar direction. We need AI to learn abstractions that are more decoupled from the source format, and we need AI that can reflect and modify its plans and update itself in real time.
All these lines of research and development are more-or-less underway. I think 5-10 years is reasonable for another big advancement in AI capability. We've shown that applying data at scale to simple models works, and now we can experiment with other representations of that data (ie other models or ways to combine LLM inferences).
Tab complete is just one of their proprietary models. I find chat-mode more helpful for refactoring and multi-file updates, even more when I specify the exact files to include.
I'd love an autoselected LLM that is fine-tuned to the syntax I'm actively using -- Cursor has a bit of a head start, but where Github and others can take it could be mindblowing (Cursor's moat is a decent VS Code extension -- I'm not sure it's a deep moat though).
With React, the guesses it makes around what props / types I want where, especially moving from file to file, is worth the price of admission. Everything else it does it icing on the cake. The new import suggestion is much quicker than the Typescript compiler lmao. And it's always the right one, instead of suggesting ones that aren't relevant.
Composer can be hit or miss, but I've found it really good at game programming.
One of the reasons for that may be the price: large code changes with multi turn conversation can eat up a lot of tokens, while those tools charge you a flat price per month. Probably many hacks are done under the hood to keep *their* costs low, and the user experiences this as lower quality responses.
Still the "architecture and core libraries" is rather corner case, something at the bottom of their current sales funnel.
also: do you really want to get equivalent of 1 FTE work for 20 USD per month?:)
> How can this be possible if you literally admit its tab completion is mindblowing?
I might suggest that coding doesn't take as much of our time as we might think it does.
Hypothetically:
Suppose coding takes 20% of your total clock time. If you improve your coding efficiency by 10%, you've only improved your total job efficiency by 2%. This is great, but probably not the mind-blowing gain that's hyped by the AI boom.
(I used 20% as a sample here, but it's not far away from my anecdotal experience, where so much of my time is spent in spec gathering, communication, meeting security/compliance standards, etc).
> How can this be possible if you literally admit its tab completion is mindblowing?
What about it makes it impossible? I’m impressed by what AI assistants can do - and in practice it doesn’t help me personally.
> Select line of code, prompt it to refactor, verify they are good, accept the changes.
It’s the “verify” part that I find tricky. Do it too fast and you spend more time debugging than you originally gained. Do it too slow and you don’t gain much time.
There is a whole category of bugs that I’m unlikely to write myself but I’m likely to overlook when reading code. Mixing up variable types, mixing up variables with similar names, misusing functions I’m unfamiliar with and more.
I think the essential point around impressive vs helpful sums up so much of the discourse around this stuff. Its all just where you fall on the line between "impressive is necessarily good" and "no it isn't".
I rarely use the tab completion. Instead I use the chat and manually select files I know should be in context. I am barely writing any code myself anymore.
Just sanity checking that the output and “piping” is correct.
My productivity (in frontend work at least) is significantly higher than before.
Out of curiosity, how long have you been working as a developer? Just that, in my experience, this is mostly true for juniors and mids (depending on the company, language, product etc. etc.). For example, I often find that copilot will hallucinate tailwind classes that don't exist in our design system library, or make simple logical errors when building charts (sometimes incorrect ranges, rarely hallucinated fields) and as soon as I start bringing in 3rd party services or poorly named legacy APIs all hope is lost and I'm better off going it alone with an LSP and a prayer.
AI will have the effect of shifting development effort from authorship to verification. As you note, we've come a long way towards making the writing of the code practically free, but we're going to need to beef up our tools for understanding code already written. I think we've only scratched the surface of AI-assisted program analysis.
honestly. I hate it. I find myself banging my head on the table because of how much endless cycles it goes through and not give me the solution haha. I probably started projects over multiple times because the code it generated was so bad lol.
That's my exact experience with GitHub Copilot. Even boilerplate stuff it sucks at as well. I have no idea why its autocomplete is so bad when it has access to my code, the function signatures, types, etc. It gets stuff wrong all the time. For example, it will just flat out suggest functions that don't exist, neither in the Python core libraries or in my own modules. It doesn't make sense.
I have all but given up on using Copilot for code development. I still do use it for autocomplete and boilerplate stuff, but I still have to review that. So there's still quite a bit of overhead, as it introduces subtle errors, especially in languages like Python. Beyond that, it's failure rate at producing running, correct code is basically 100%.
I'm building a tool in this space and believe it's actually multiple separate problems. From most to least solvable:
1. AI coding tools benefit a lot from explicit instructions/specifications and context for how their output will be used. This is actually a very similar problem to when eg someone asks a programmer "build me a website to do X" and then being unhappy with the result because they actually wanted to do "something like X", and a payments portal, and yellow buttons, and to host it on their existing website. So models need to be given those particular instructions somehow (there are many ways to do it, I think my approach is one of the best so far) and context (eg RAG via find-references, other files in your codebase, etc)
2. AI makes coding errors, bad assumptions, and mistakes just like humans. It's rather difficult to implement auto-correction in a good way, and goes beyond mere code-writing into "agentic" territory. This is also what I'm working on.
3. AI tools don't have architecture/software/system design knowledge appropriate represented in their training data and all the other techniques used to refine the model before releasing it. More accurately, they might have knowledge in the form of eg all the blog posts and docs out there about it, but not skill. Actually, there is some improvement here, because I think o1 and 3.5 sonnet are doing some kind of reinforcement-learning/self-training to get better at this. But it's not easily addressable on your end.
4. There is ultimately a ton of context cached in your brain that you cannot realistically share with the AI model, either because it's not written anywhere or there is just too much of it. For example, you may want to structure your code in a certain way because your next feature will extend it or use it. Or your product is hosted on serving platform Y which has an implementation detail where it tries automatically setting Content-Type response headers by appending them to existing headers, so manually setting Content-Type in the response causes bugs on certain clients. You can't magically stuff all of this into the model context.
My product tries to address all of these to varying extents. The largest gains in coding come from making it easier to specify requirements and self-correct, but architecture/design are much harder and not something we're working on much. You or anybody else can feel free to email me if you're interested in meeting for a product demo/feedback session - so far people really like our approach to setting output specs.
Every single one of these discussion, at some point, devolves to some version of
- <LLM Y> is by far the best. In my extensive usage it is consistently outperforms <LLM X> by at least 2x. The difference is night and day.
Then the immediate child reply:
- What!? You must be holding it wrong. The complete inverse is true for me.
I don't know what to make of this contradiction. We're all using the same 2 things right? How can opinions vary by such a large amount. It makes me not trust any opinion on any other subject (which admittedly is not a bad default state, but who has time to form their own opinions on everything).
People are learning to prompt LLMs in ways that produce better results for their LLM of choice, so switching to another one they find their approach no longer works as well.
Or.. LLMs have different personalities in terms of output; some being more or less direct/polite than others, or sounding more or less confident; and that is causing people to perceive a difference that in terms of factual answers may not be different.
Or just personal preference masquerading as intelligence - a classic among software engineers.
Reviewing these conversations is like listening to horse and buggy manufacturers pooh-poohing automobiles:
1. they will scare the horses. a good team of horses is no match for funky 'automobile'
2. how will they be able to deal with our muddy, messy roads
3. their engines are unreliable and prone to breaking down
stranding you in the middle and having to do it yourself..
4. their drivers cant handle the speed, too many miles driven means unsafe driving.. we should stick to horses they are manageable.
Meanwhile I'm watching a community of mostly young people building and using tools like copilot, cursor, replit, jacob etc and wiring up LLMs into increasingly more complex workflows.
this is snapshot of the current state, not a reflection of the future- Give it 10 years
It is hard to understand some phenomena if it stands to reduce your income. Even if the LLMs don't improve one bit from here and current state is froze, they are still too good and will be everywhere before we can finish talking of horses and automobiles.
LLMs make my job as a software engineer even more secure. Most of what I do is social and/or understand what is going on. LLMs are a tool to reduce mental load when in VSCode on some tasks. They are like the pilot's autopilot.
LLM takes my job then we have reached the singularity. Jobs wont matter anymore at that point.
Except the automobile in this case only reaches the destination correctly sometimes. They are less likely to reach the destination as the path becomes longer or more complex.
I'm not sure why people can't be humble enough to accept that we don't really know what the future will hold. Just because people have underestimated some new technology in the past doesn't mean that will continue to be true for all new technologies.
The fact that LLMs currently do not really understand the answers they're giving you is a pretty significant limitation they have. It doesn't make them useless, but it means they're not as useful at a lot of tasks that people think they can handle. And that limitation is also fundamental to how LLMs work. Can that be overcome? Maybe. There's certainly a ton of money behind it and a lot of smart people are working on it. But is it guaranteed?
Perhaps I'm wrong and we already know that it's simply a matter of time. I'd love to read an technical explanation for why that is, but I mostly see people rolling their eyes at us mere mortals who don't see how this will obviously change everything as if we're too small minded to understand what's going on.
To be extra clear, I'm not saying LLMs won't be a technological innovation as seismic as the invention of the car. My confusion is why for some there doesn't seem to be room for doubt.
Prospective and retrospective analysis are fundamentally different. It’s easy to point to successes and failures of the past, but that’s not how we predict the concrete future potential of one specific thing.
I dont see a young/old divide when it comes to AI. Altough there is a young/old divide in familial responsibilies and willingness to be a chip on the VC's roulette table.
Absolutely true. The oldest devs I work with are some of the most enthusiastic about using LLM chat to develop. Among the younger devs, they all seem to use it but the amount that can actually produce working code are few.
Now I get a lot of calls from team asking for help fixing some code they got from an AI.. Overall it is improving the code quality from the group, I no longer have to instruct people on basics to set up their approach/solution. Will admit there is a little difficulty dealing with pushback on my guidance because e.g. “well chatgpt said I should use this library” when the core SDK already supports something more recent than the AI was trained on
I think a lot of the criticism is constructive. Many of the limitations won’t just magically go away - we’ll have to build tooling and processes and adjust our way of thinking to get there. Most devs will jump across to anything useful the second it’s ready, I would think
This is pretty exciting. I'm a copilot user at work, but also have access to Claude. I'm more inclined to use Claude for difficult coding problems or to review my work as I've just grown more confident in its abilities over the last several months.
I use both Claude and ChatGPT/GPT-4o a lot. Claude, the model, definitely is 'better' than GPT-4o. But OpenAI provides a much more capable app in ChatGPT and an easier development platform.
I would absolutely choose to use Claude as my model with ChatGPT if that happened (yes, I know it won't). ChatGPT as an app is just so far ahead: code interpreter, web search/fetch, fluid voice interaction, Custom GPTs, image generation, and memory. It isn't close. But Claude absolutely produces better code, only being beaten by ChatGPT because it can fetch data from the web to RAG enhance its knowledge of things like APIs.
Claude's implementation of artifacts is very good though, and I'm sure that is what lead OpenAI to push out their buggy canvas feature.
It’s all a dice game with these things, you have to watch them closely or they start running you (with bad outcomes). Disclaimers aside:
Sonnet is better in the small, by a lot. It’s sharply up from idk, three months ago or something when it was still an attractive nuisance. It still tops out at “Best SO Answer”, but it hits that like 90%+. If it involves more than copy paste, sorry folks, it’s still just really fucking good copy paste.
But for sheer “doesn’t stutter every interaction at the worst moment”? You’ve got to hand it to the ops people: 4o can give you second best in industrial quantity on demand. I’m finding that if AI is good enough, then OpenAI is good enough.
>If it involves more than copy paste, sorry folks, it’s still just really fucking good copy paste.
Are you sure you're using Claude 3.5 Sonnet? In my experience it's absolutely capable of writing entire small applications based off a detailed spec I give it, which don't exist on GitHub or Stack Overflow. It makes some mistakes, especially for underspecified things, but generally it can fix them with further prompting.
Are there any good 3rd-party native frontend apps for Claude (on MacOS)? I mean something like ChatGPTs app, not an editor. I guess one option would be to just run Claude iPad app on MacOS.
Jan [0] is MacOS native, open source, similar feel to the ChatGPT frontend, very polished, and offers Anthropic integration (all Claude models).
It also features one-click installation, OpenAI integration, a hub for downloading and running local models, a spec-compatible API server, global "quick answer" shortcut, and more. Really can't recommend it enough!
If you're willing to settle for a client-side only web frontend (i.e. talks directly with APIs of the models you use), TypingMind would work. It's paid, but it's good (see [0]), and I guess you could always go for the self-hosted version and wrap it in an Electron app - it's what most "native" apps are these days anyway (and LLM frontends in particular).
Open-WebUI doesn't support Claude natively (only through a series of hacks) but it is absolutely "THE" go-to for a ChatGPT Pro like experience (it is slightly better).
FWIW, I was able to get a decent way into making my own client for ChatGPT by asking the free 3.5 version to do JS for me* before it was made redundant by the real app, so this shouldn't be too hard if you want a specific experience/workflow?
* I'm iOS by experience; my main professional JS experience was something like a year before jQuery came out, so I kinda need an LLM to catch me up for anything HTML
> ChatGPT as an app is just so far ahead: code interpreter, web search/fetch, fluid voice interaction, Custom GPTs, image generation, and memory. It isn't close.
Funny thing, TypingMind was ahead of them for over a year, implementing those features on top of the API, without trying to mix business model with engineering[0]. It's only recently that ChatGPT webapp got more polished and streamlined, but TypingMind's been giving you all those features for every LLM that can handle it. So, if you're looking for ChatGPT-level frontend to Anthropic models, this is it.
ChatGPT shines on mobile[1] and I still keep my subscription for that reason. On desktop, I stick to TypingMind and being able to run the same plugins on GPT-4o and Claude 3.5 Sonnet, and if I need a new tool, I can make myself one in five minutes with passing knowledge of JavaScript[2]; no need to subscribe to some Gee Pee Tee.
Now, I know I sound like a shill, I'm not. I'm just a satisfied user with no affiliation to the app or the guy that made it. It's just that TypingMind did the bloodingly stupid obvious thing to do with the API and tool support (even before the latter was released), and continues to do the obvious things with it, and I'm completely confused as to why others don't, or why people find "GPTs" novel. They're not. They're a simple idea, wrapped in tons of marketing bullshit that makes it less useful and delayed its release by half a year.
--
[0] - "GPTs", seriously. That's not a feature, that's just system prompt and model config, put in an opaque box and distributed on a marketplace for no good reason.
[1] - Voice story has been better for a while, but that's a matter of integration - OpenAI putting together their own LLM and (unreleased) voice model in a mobile app, in a manner hardly possible with the API their offered, vs. TypingMind being a webapp that uses third party TTS and STT models via "bring your own API key" approach.
Have you tried using Cursor with Claude embedded? I can't go back to anything else, it's very nice having the AI embedded in the IDE and it just knows all the files i am working with. Cursor can use GPT-4o too if you want
I too use Claude more frequently than OpenAi GPT4o. I think this is a two fold move for MS and I like it. Claude being more accurate / efficient for me says it's likely they see the same thing, win number 1. The second is with all the OpenAI drama MS has started to distance themselves over a souring relationship (allegedly). If so, this could be a smart move away tactfully.
Either way, Claude is great so this is a net win for everyone.
I'm the same, but had a lot of issues getting structured output from Anthropic. Ended up always writing response processors. Frustrated by how fragile that was, decided to try OpenAI structured outputs and it just worked and since they also have prompt caching now, it worked out very well for my use case.
Anthropic's seems to have addressed the issue using pydantic but I haven't had a chance to test it yet.
>The second is with all the OpenAI drama MS has started to distance themselves over a souring relationship (allegedly). If so, this could be a smart move away tactfully.
I agree, this was a tactical move designed to give them leverage over OpenAI.
I recently tried to ask these tools for help with using a popular library, and both GPT-4o and Claude 3.5 Sonnet gave highly misleading and unusable suggestions. They consistently hallucinated APIs that didn't exist, and would repeat the same wrong answers, ignoring my previous instructions. I spent upwards of 30 minutes repeating "now I get this error" to try to coax them in the right direction, but always ending up in a loop that got me nowhere. Some of the errors were really basic too, like referencing a variable that was never declared, etc. Finally, Claude made a tangential suggestion that made me look into using a different approach, but it was still faster to look into the official documentation than to keep asking it questions. GPT-4o was noticeably worse, and I quickly abandoned it.
If this is the state of the art of coding LLMs, I really don't see why I should waste my time evaluating their confident sounding, but wrong, answers. It doesn't seem like much has improved in the past year or so, and at this point this seems like an inherent limitation of the architecture.
To be clear, I didn't ask it to write something complex. The prompt was "how do I do X with library Y?", with a bit more detail. The library is fairly popular and in a mainstream language.
I had a suspicion that what I was trying to do was simply not possible with that library, but since LLMs are incapable of saying "that's not possible" or "I don't know", they will rephrase your prompt and hallucinate whatever might plausibly make sense. They have no way to gauge whether what they're outputting is actually correct.
So I can imagine that you sometimes might get something useful from this, but if you want a specific answer about something, you will always have to double-check their work. In the specific case of programming, this could be improved with a simple engineering task: integrate the output with a real programming environment, and evaluate the result of actually running the code. I think there are coding assistant services that do this already, but frankly, I was expecting more from simple chat services.
Specific is the specific thing that statistical models are not good at :(
> how do I do X with library Y?
Recent research and anecdotal experience has shown that LLMs perform quite poorly with short prompts. Attention just has more data to work with when there are more tokens. Try extending that question like “I am using this programming language and am trying to do this task with this library. How do I do this thing with this other library”
I realize prompt engineering like this is fuzzy and “magic,” but short prompts have a consistent lower performance.
> In the specific case of programming, this could be improved with a simple engineering task: integrate the output with a real programming environment, and evaluate the result of actually running the code.
Not as simple as you’d think. You’re letting something run arbitrary code.
Tho you should give aider.chat a try if you want to test out that workflow. I found it very very slow.
Well it is volume business. <1% of advanced skill developers will find AI helper useless but for 99% of IT CRUD peddlers these tools are quite sufficient. All in all if employers cut down 15-20% of net development costs by reducing head counts, it will be very worthwhile for companies.
Agree. But we are already in that loop. A 50KLOC properly written "Monolith, hence outdated" app is now 30 micro services of 20KLOC surface + 100KLOC of submerged in terms of convenience libraries with kubernetes, grafana, datadog, servicemesh and so on. From what I am seeing companies are increasingly using off the shelf components so KLOC will keep rising but developer count would not.
It's worse than that. Now the balls of mud are distributed. We get incredibly complex interactions between services which need a lot of infrastructure to enable them, that requires more observability, which requires more infrastructure...
Yeah. You can fit a lot of business logic into a 100kloc monolith written by skilled developers.
Once you start shifting it to micro services the business logic gets spread out and duplicated.
At the same time each micro-service now has its own code to handle rest, graphql, grpc endpoints.
And each downstream call needs error handling and retry logic.
And of course now you need distributed tracing.
And of course now your auth becomes much more complex.
And of course now each service might be called multiple times for the one request - better make them idempotent.
And each service will drift in terms of underlying libraries.
And so on.
Now we have been adding in LLM solutions so there is no consistency in any of the above services.
Each dev rather than look at the existing approaches instead asks Claude and it provides a slightly different way each time - often pulling in additional libraries we have to support.
These days I see so much bad code like a single microservice with 3 different approaches to making a http request.
Sure, but my specific question was fairly trivial, using a mainstream language and a popular library. Most of my work qualifies as CRUD peddling. And yet these tools are still wasting my time.
Maybe I'll have better luck next time, or maybe I need to improve my prompting skills, or use a different model, etc. I was just expecting more from state of the art LLMs in 2024.
Yeah there is a big disconnect between the devs caught up in the hype and the devs who aren't.
A lot of the devs in my office using Claude/gpt are convinced they are so much more productive but they aren't actually producing features or bug fixes any faster.
I think they are just excited about a novel new way to write code.
Conversely I feel that the experience of searching has been degraded by a lot since 2016/17. My these is that, at this time, online spam increased by an order of magnitude
Old style Google search is dead, folks just haven’t closed the casket yet. My index queries are down ~90%. In the future, we’ll look back at LLMs as a major turning point in how people retrieve and consume information.
I think it was the switch from desktop search traffic being dominant to mobile traffic being dominant, that switch happened around the end of 2016.
Google used to prioritise big comprehensive articles on subjects for desktop users but mobile users just wanted quick answers, so that's what google prioritised as they became the biggest users.
But also, per your point, I think those smaller simpler less comprehensive posts are easier to fake/spam than the larger more compreshensible posts that came before.
It's getting ridiculous. Half of the time now when I ask AI to search some information for me, it finds and summarizes some very long article obviously written by AI, and lacking any useful information.
The speed with which AI models are improving blows my mind. Humans quickly normalize technological progress, but it's staggering to reflect on our progress over just these two years.
Yes! I'm much more inclined to write one-off scripts for short manual tasks as I can usually get AI to get something useful very fast. For example, last week I worked with Claude to write a script to get a sense of how many PRs my company had that included comprehensive testing. This was borderline best done as a manual task previously, now I just ask Claude to write a short bash script that uses the GitHub CLI to do it and I've got a repeatable reliable process for pulling this information.
I rarely use LLMs for tasks but i love it for exploring spaces i would otherwise just ignore. Like writing some random bash script isn't difficult at all, but it's also so fiddly that i just don't care to do it. It's nice to just throw a bot at it and come back later. Loosely speaking.
Still i find very little use from LLMs in this front, but they do come in handy randomly.
Lots of progress, but I feel like we've been seeing diminishing returns. I can't help but feel like recent improvements are just refinements and not real advances. The interest in AI may drive investment and research in better models that are game-changers, but we aren't there yet.
You're proving GP's point about normalization of progress. It's been two years. We're still during the first iteration of applications of this new tech, advancements didn't have time yet to start compounding. This is barely getting started.
I don't know about you, but o1-preview/o1-mini has been able to solve many moderately challenging programming tasks that would've taken me 30 mins to an hour. No other models earlier could've done that.
It's an improvement but...I've asked it to do some really simple tasks and it'll occasionally do them in the most roundabout way you could imagine. Like, let's source a bash file that creates and reads a state file to do something for which the functionality was already built-in. Say I'm a little skeptical of this solution and plug it into a new o1-preview prompt to double check the solution, and it starts by critiquing the bash script and error handling instead of seeing that the functionality is baked in and it's plainly documented. Other errors have been more subtle.
When it works, it's pretty good, and sometimes great. But when failure modes look like the above I'm very wary of accepting its output.
I wonder how long people will still protest in these threads that "It doesn't know anything! It's just an autocomplete parrot!"
Because.. yea, it is. However.. it keeps expanding, it keeps getting more useful. Yea people and especially companies are using it for things which it has no business being involved in.. and despite that it keeps growing, it keeps progressing.
I do find the "stochastic parrot" comments slowly dwindle in number and volume with each significant release, though.
Still, i find it weirdly interesting to see a bunch of people be both right and "wrong" at the same time. They're completely right, and yet it's like they're also being proven wrong in the ways that matter.
There's the question, "is an LLM just autocomplete"? The answer to that question is obviously no, but the question is also a strawman - people who actually use LLM's regularly do recognize that there is more to their capabilities than randomized pattern matching.
Separately, there's the question of "will LLM's become AGI and/or become super intelligent." Most people recognize that LLM's are not currently super intelligent, and that there currently isn't a clear path toward making them so. Still, many people seem to feel that we're on the verge of progress here, and feel very strongly that anyone who disagrees is an AI "doomer".
Then there's the question of "are we in an AI bubble"? This is more a matter of debate. Some would argue that if LLM reasoning capabilities plateau, people will stop investing in the technology. I actually don't agree with that view - I think there is a lot of economic value still yet to be realized in AI advancements - I don't think we're on the verge of some sort of AI winter, even if LLM's never become super intelligent.
When the temperature is 0.5, both Claude 3.5 and GPT-4o can't properly recognize that GitHub is capitalized. You can see the responses by clicking in the sentence. Each model was asked to validate the sentence 5 times.
If the temperature is set to 0.0, most models will get it right (most of the time), but Claude 3.5 still can't see the sentence in front of it.
> I think calling it intelligent is being extremely generous ... can't properly recognize that GitHub is capitalized.
Wouldn't this make chimpanzees and ravens and dolphins unintelligent too? You're asking it to do a task that's (mostly) easy for humans. It's not a human though. It's an alien intelligence which "thinks" in our language, but not in the same way we do.
If they could, specialized AI might think we're unintelligent based on how often we fail, even with advanced tools, pattern matching tasks that are trivial for them. Would you say they're right to feel that way?
Animals are capable of learning. LLMs can not. LLM uses weights that are defined during the training process to decide what to do next. LLM cannot self evaluate based on what it has said. You have to create a new message for it to create a new probability path.
Animals have the ability to learn and grow by themselves. LLMs are not intelligent and I don't see how they can be since they just follow the most likely path with randomness (temperature) sprinkled in.
The "statistical parrot" parrots have been demonstrably wrong for years (see e.g. LeCun et al[1]). It's just harder to ignore reality with hundreds of millions of people now using incredible new AI tools. We're approaching "don't believe your lying eyes" territory. Deniers will continue pretending that LLMs are just an NFT-level fad or bubble or whatever. The AI revolution will continue to pass them by. More's the pity.
> Deniers will continue pretending that LLMs are just an NFT-level fad or bubble or whatever. The AI revolution will continue to pass them by. More's the pity.
You should re-read that very slowly and carefully and really think about it. Calling anyone that's skeptical a 'denier' is a red flag.
We have been through these AI cycles before. In every case, the tools were impressive for their time. Their limitations were always brushed aside and we would get a hype cycle. There was nothing wrong with the technology, but humans always like to try to extrapolate their capabilities and we usually get that wrong. When hype caught up to reality, investments dried up and nobody wanted to touch "AI" for a while.
Rinse, repeat.
LLMs are again impressive, for our time. When the dust settles, we'll get some useful tools but I'm pretty sure we will experience another – severe – AI winter.
If we had some optimistic but also realistic discussions on their limitations, I'd be less skeptical. As it is, we are talking about 'revolution', and developers being out of jobs, and superintelligence and whatnot. That's not the level the technology is at today and it is not clear we are going to do anything else other than get stuck in a local maxima.
I don't know how you can say they lack understanding of the world when in pretty much any standardised test designed to measure human intelligence they perform better than the average human. They only thing that don't understand is touch because they're not trained on that, but they can already understand audio and video.
You said it, those tests are designed to measure human intelligence, because we know that there is a correspondence between test results and other, more general tasks - in humans. We do not know that such a correspondence exists with language models. I would actually argue that they demonstrably do not, since even an LLM that passes every IQ test you put in front of it can still trip up on trivial exceptions that wouldn't fool a child.
No you don’t understand, if i put a billion billion trillion monkeys on typewriters, they’re actually now one super intelligent monkey because they’re useful now!
We just need more monkeys and it will be the same as a human brain.
Are you confusing frequency of use with usefulness?
If these tools boost tue productivity where is the output spike of all the companies, the spike in revenue and profits?
How often do we lose the benefit auto text generation to the loop of
That’s wrong
Oh yes of course, here is the correct version
Nope, still wrong
Prompt editing?
One service is not really enough -- you need a few to triangulate more often than not, especially when it comes to code using latest versions of public APIs
Phind is useful as you can switch between them -- but only get a handful of o1 and Opus a day which I burn through quick at moment on deeper things -- Phind-405b and 3.5 Sonnet are decent for general use
I wonder what the rationale for this was internally. More OpenAI issues? competitiveness with Cursor? It seems good for the user to increase competition across LLM providers.
Also ambiguous title. I thought GitHub canceled deals they had in the work. The article is clearly about making a deal, but it's unclear from the article's title.
Could be a fight against Llama, which excludes MS and Google in its open license (though I think has done separate pay deals with one or both of them). Meta are notably absent from this announcement.
I usually feel like i can confidently express a change I want in code faster and better than I can explain what I want an AI to do in English. Like if I have a good prompt, these tools work okay, but getting that prompt almost as hard as just writing the code itself often. Do others feel the same struggle?
I’m positive my experience pales in comparison to yours, as I don’t actually code anything beyond the occasional single use script, but YES! I hate trying to explain the exact SQL result I’m looking for or some text modification I need to be able to throw together a CTE since I have read-only access and can’t even build a temp table.
I don’t know how people can claim such huge success using copilot and such. I also own a subscription and tried to use it for coding but all task from spring boot authentication configuration to aws policies and lambdas it failed horribly.
Writing the code myself using proper documentation was the only option.
I wonder if false information is written here in the comments section for certain reasons …
You need to spend time learning how to use it. This is difficult because there's no manual, and there's a widespread implication that it should just magically work well without you having to invest any effort in it.
If you can figure out HOW to invest that effort it becomes really valuable.
I wish I had good resources I could link you to here but I don't, which is a big part of the problem here.
You sound like you’ve had success using this tech for work. Can you tell more about your personal experience, please? I’ve tried ChatGPT a few times a year ago or so, but it was extremely frustrating, and I gave up.
I’ve been using Cody from Sourcegraph to have access to other models, if copilot offers something similar I guess I will switch back to it. I find copilot autocomplete to be more often on point than Cody, but the chat experience with Cody + Sonnet 3.5 is way ahead in may experience
Context is a huge part of the chat experience in Cody, and we're working hard to stay ahead there as well with things like OpenCtx (https://openctx.org) and more code context based on the code graph (defs/refs/etc.). All this competition is good for everyone. :)
I'm a hypocrite because I'm now currently paying for Cody due to their integration with the new OpenAI o1-preview model. I find this model to be mind blowing and it's made me actually focus on the more mundane tasks that come with the job.
Obviously! However, Google being an American company, that was surprising. I'm in Europe and am used to seeing newest posts "from yesterday" when they are from the USA. This one is weird.
It may not be a model quality issue. It may be that GitHub wants to sell a lot more of Copilot, including to companies who refuse to use anything from OpenAI. Now GitHub can say "Oh that's fine, we have these two other lovely providers to choose from."
Also, after Anthropic and Google sold massive amounts of pre-paid usage credits to companies, those companies want to draw down that usage and get their money's worth. GitHub might allow them to do that through Copilot, and therefore get their business.
I think that the credit scenario is more true for OpenAI than others . Existing Azure commits can be used to buy OpenAI via the marketplace. It will never be as simple for any non Azure partner (Only Github is tying up with Anthropic here not Azure)
GitHub doesn’t even support using those azure managed APIs for copilot today, it is just a license you can buy currently and add to a user license. The best you can do is pay for copilot with existing azure commits .
This seems about not being left behind as other models outpace what copilot can do with their custom OpenAI model that doesn’t seem to getting updated .
Yes, because transfer learning works. A specialized model for X will be subsumed by a general model for X/Y/Z as it becomes better at Y/Z. This is why models which learn other languages become better at English.
Custom models still have use cases, e.g. situations requiring cheaper or faster inference. But ultimately The Bitter Lesson holds -- your specialized thing will always be overtaken by throwing more compute at a general thing. We'll be following around foundation models for the foreseeable future, with distilled offshoots bubbling up/dying along the way.
Evaluating cross-lingual transfer learning approaches in multilingual conversational agent models[1]
Cross-lingual transfer learning for multilingual voice agents[2]
Large Language Models Are Cross-Lingual Knowledge-Free Reasoners[3]
An Empirical Study of Cross-Lingual Transfer Learning in Programming Languages[4]
That should get you started on transfer learning re. languages, but you'll have more fun personally picking interesting papers over reading a random yahoo's choices. The fire hose of papers is nuts, so you'll never be left wanting.
I still think it’s worth emphasising - LLMs represent a massive capital absorber. Taking gobs of funding into your company is how you grow, how your options become more valuable, how your employees stay with you. If that treadmill were to break bad things happen.
Search has been stuttering for a while - Google’s growth and investment has been flattening - at some point they absorbed all the worlds stored information.
OpenAI showed the new growth - we need billions of dollars to build and the run the LLMs (at a loss one assumes) - the treadmill can keep going
One of the reasons that comes to my mind is - it could have been problematic look for only Microsoft (Copilot) to have access to GitHub for training AI models - à la monopolizing a data treasure trove. With anti-competitive legislation catching up to Google to open up its Play Store, this could have been one of key reasons why this deal came about.
Copilot can choke on my AGPL code on GitHub, that was used for training their proprietary models. I'm still salty about this, sadly looks like the world has largely moved on.
The Claude terms of service [1] apparently preclude Anthropic or AWS using GitHub user data for training:
GitHub Copilot uses Claude 3.5 Sonnet hosted on Amazon Web Services. When using Claude 3.5 Sonnet, prompts and metadata are sent to Amazon's Bedrock service, which makes the following data commitments: Amazon Bedrock doesn't store or log your prompts and completions. Amazon Bedrock doesn't use your prompts and completions to train any AWS models and doesn't distribute them to third parties.
It really feels like a digital form of colonialism; they come in take everything, completely disregard the rules, ignore intellectual copyright laws (while you still have to obey them), but when you speak out against this suddenly you are a luddite that doesn't care about human progress.
> Silicon Knights had "deliberately and repeatedly copied thousands of lines of Epic Games' copyrighted code, and then attempted to conceal its wrongdoing by removing Epic Games' copyright notices and by disguising Epic Games' copyrighted code as Silicon Knights' own
> Epic Games prevailed against Silicon Knights' lawsuit, and won its counter-suit for $4.45 million on grounds of copyright infringement,
> following the loss of the court case, Silicon Knights filed for bankruptcy
Is Microsoft Copilot even a single product? It seems to me they just shove AI in random places throughout their products and call it Copilot. Which would make Github Copilot essentially another one of these places the branding shows up (even if it started there)
Because they previously decreased more under the expectation of another half point cut by the fed. Stronger economic indicators have cut the expectation for steep rate cuts so treasuries are declining.
Call me eccentric but the only true or utilitarian use case I've found for AI so far is chatgpt. Rest all appear to be shiny toys just trying to bask in the AI glory but none solve any real human problem?
1 point by Fairburn 0 minutes ago | prev | next | edit | delete [–]
I have no doubts that Claude is serviceable from a coders perspective. But for me, as a paid user, I became tired of being told that I have to slow down and then be cut off while actively working on a product. When Anthropic addresses this, Ill add it back to my tools.
I mentored junior SWE and CS students for years, and now using Claude as a coding assistant feels very similar. Yesterday, it suggested implementing a JSON parser from scratch in C to avoid a dependency -- and, unsurprisingly, the code didn’t work. Two main differences stand out: 1) the LLM doesn’t learn from corrections (at least not directly), and 2) the feedback loop is seconds instead of days. This speed is so convenient that it makes hiring junior SWEs seem almost pointless, though I sometimes wonder where we’ll find mid-level and senior developers tomorrow if we stop hiring juniors today.
Does speed matter when it's not getting better and learning from corrections? I think I'd rather give someone a problem and have them come back with something that works in a couple days (answering a question here or there), rather than spend my time doing it myself because I'm getting fast, but wrong, results that aren't improving from the AI.
> though I sometimes wonder where we’ll find mid-level and senior developers tomorrow if we stop hiring juniors today.
This is also a key point. While there is a lot of short term thinking these days, since people don't stick with companies like they used to. As a person who has been with my company for close to 20 years, making sure things can still run once you leave is important from a business perspective.
Training isn't about today, it's about tomorrow. I've trained a lot of people, and doing it myself would always be faster in the moment. But it's about making the team better and making sure more people have more skill, to reduce single points of failure and ensure business continuity over the long-term. Not all of it pays off, but when it does, it pays off big.
Years of experience doesn't correlate to a good developer either. I've seen senior devs using AI to solve impossible problems, for example asking it how to store an API key client side without leaking it...
I guess this goes to show, nobody really has a moat in this game so far. Everyone is sprinting like crazy but I don't see anyone really gaining a sustainable edge that will push out competitors.
This kind of thing is why I think Sam is often misjudged. You can’t fuck around in such a competitive market. If you go in all kumbaya you’ll get crushed by market forces. It’s rare for company/founder ideals to survive the market indefinitely. I think he’s iterated fast and the job is still very hard.
History has shown being first to market isn't all it's cut out to be. You spend more, it's more difficult creating the trail others will follow, you end up with a tech stack that was built before tools and patterns stabilized and you've created a giant super highway for a fast-follower. Anyone remember MapQuest, AltaVista or Hotmail?
OpenAI has some very serious competition now. When you combine that with the recent destabilizing saga they went through along with commoditization of models with services like OpenRouter.ai, I'm not sure their future is as bright as their recent valuation indicates.
They seem to be going after different markets, or at least having differing degrees of success in going after different markets.
OpenAI is most successful with consumer chat app (ChatGPT) market.
Anthropic is most successful with business API market.
OpenAI currently has a lot more revenue than Anthropic, but it's mostly from ChatGPT. For API use the revenue numbers of both companies are roughly the same. API success seems more important that chat apps since this will scale with success of the user's business, and this is really where the dream of an explosion in AI profits comes from.
ChatGPT's user base size vs that of Claude's app may be first mover advantage, or just brand recognition. I use Claude (both web based and iOS app), but still couldn't tell you if the chat product even has a name distinct from the model. How's that for poor branding?! OpenAI have put a lot of effort into the "her" voice interface, while Anthropic's app improvements are more business orientated in terms of artifacts (which OpenAI have now copied) and now code execution.
Just wanted to add a note to this. Tool calling - particularly to source external current data - is something that's had the big foundational LLM providers very nervous so they've held back on it, even though it's trivial to implement at this point. But we're seeing it rapidly emerge with third party providers who use the foundational APIs. Holding back tool calling has limited the complex graph-like execution flows that the big providers could have implemented on their user facing apps e.g. the kind of thing that Perplexity Pro has implemented. So they've fallen behind a bit. They may catch up. If they don't they risk becoming just an API provider.
Most people's first exposure to LLMs was ChatGPT, and that was only what - like 18 months ago it really took off in the mainstream? We're still very early on in the grand scheme of things.
Yes it's silly to talk about first mover advantage in sub 3 years. Maybe in 2026 we can revisit this question and see if being the first mattered.
First mover being a general myth doesn't mean being the first to launch and then immediately dominating the wider market for a long period is impossible. It's just usually means their advantage was about a lot more than simply being first.
Claude is only better in some cherry picked standard eval benchmarks, which are becoming more useless every month due to the likelihood of these tests leaking into training data. If you look at the Chatbot Arena rankings where actual users blindly select the best answer from a random choice of models, the top 3 models are all from OpenAI. And the next best ones are from Google and X.
I'm subscribed to all of Claude, Gemini, and ChatGPT. Benchmarks aside, my go-to is always Claude. Subjectively speaking, it consistently gives better results than anything else out there. The only reason I keep the other subscriptions is to check in on them occasionally to see if they've improved.
I don't pay any attention to leaderboards. I pay for both Claude and ChatGPT and use them both daily for anything from Python coding to the most random questions I can think of. In my experience Claude is better (much better) that ChatGPT in almost all use cases. Where ChatGPT shines is the voice assistant - it still feels almost magical having a "human-like" conversation with the AI agent.
Anecdotally, I disagree. Since the release of the "new" 3.5 Sonnet, it has given me consistently better results than Copilot based on GPT-4o.
I've been using LLMs as my rubber duck when I get stuck debugging something and have exhausted my standard avenues. GPT-4o tends to give me very general advice that I have almost always already tried or considered, while Claude is happy to say "this snippet looks potentially incorrect; please verify XYZ" and it has gotten me back on track in maybe 4/5 cases.
Bullshit. Claude 3.5 Sonnet owns the competition according to the most useful benchmark: operating a robot body in the real world. No other model comes close.
This seems incorrect. I don't need Claude 3.5 Sonnet to operate a robot body for me, and don't know anyone else who does. And general-purpose robotics is not going to be the most efficient way to have robots do many tasks ever, and certainly not in the short term.
Of course not but the task requires excellent image understanding, large context window, a mix of structured and unstructured output, high level and spatial reasoning, and a conversational layer on top.
I find it’s predictive of relative performance in other tasks I use LLMs for. Claude is the best. The only shortcoming is its peculiar verbosity.
Definitely superior to anything OpenAI has and miles beyond the “open weights” alternatives like Llama.
The problem is that it also fails on fairly simple logic puzzles that ChatGPT can do just fine.
For example, even the new 3.5 Sonnet can't solve this reliably:
> Doom Slayer needs to teleport from Phobos to Deimos. He has his pet bunny, his pet cacodemon, and a UAC scientist who tagged along. The Doom Slayer can only teleport with one of them at a time. But if he leaves the bunny and the cacodemon together alone, the bunny will eat the cacodemon. And if he leaves the cacodemon and the scientist alone, the cacodemon will eat the scientist. How should the Doom Slayer get himself and all his companions safely to Deimos?
In fact, not only its solution is wrong, but it can't figure out why it's wrong on its own if you ask it to self-check.
In contrast, GPT-4o always consistently gives the correct response.
This brings up the broader question: why are AI companies so bad at naming their products?
All the OpenAI model names look like garbled nonsense to the layperson, while Anthropic is a bit of a mixed bag too. I'm not sure what image Claude is supposed to conjure, Sonnet is a nice name if it's packaged as a creative writing tool but less so for developers. Meta AI is at least to the point, though not particularly interesting as far as names go.
Gemini is kind of cool sounding, aiming for the associations of playful/curious of that zodiac sign. And the Gemini models are about as unreliable as astrology is for practical use, so I guess that name makes the most sense.
Yes, muscle memory is powerful. But it's not an insurmountable barrier for a follower. The switch from Google to various AI apps like Perplexity being a case in point. I still find myself beginning to reach for Google and then 0.1 seconds later catching myself. As a side note: I'm also catching myself having a lack of imagination when it comes to what is solvable. e.g. I had a specific technical question about github's UX and how to get to a thing that no one would have written about and thus Google wouldn't know, but openAI chat nailed it first try.
Honestly I think the biggest reason for this is that Claude requires you to login via an email link whereas OpenAI will let you just login with any credentials.
This matters if you have a corporate machine and can't access your personal email to login.
Thank you people for contributing to this free software ecosystem. Oh, you can't monetize your work? Your problem, not ours! Deals are made, but you, who provide your free code, we have zero monetization options for you on our github platform. Go pay for copilot which was trained on your data.
I mean, this is the worst farce ever concocted. And people are oblivious what's happening...
We are not oblivious. We are powerless. Oracle could go toe to toe with Google and threaten multibillion fines over basically API and 11kLOC. As a open source developer, there is no way to match that.
So GitHub’s teaming up with Google, Anthropic, and OpenAI? Kinda feels Microsoft’s version of a ‘safety net’, but for who exactly? It’s hard not to wonder if this is actually about choice for the user or just insurance for Microsoft
Frankly surprised to see GitHub (Microsoft) signing a deal with their biggest competitor, Google. It does give Microsoft some good terms/pricing leverage over OpenAI, though I'm not sure what degree Microsoft needs that given their investment in OpenAI.
GitHub Spark seems like the most interesting part of the announcement.
If you want to destroy open source completely, the more models the better. Microsoft's co-opting and infiltration of OSS projects will serve as a textbook example of eliminating competition in MBA programs.
And people still support it by uploading to GitHub.
> This is performed with no acknowledgement of authorship or lineage, with no attribution or citation.
GitHub hosts a lot of source code, including presumably the code it trained CoPilot on. So they satisfy any license that requires sharing the code and license, such as GPL 3. Not sure what the problem is.
> The social rewards (e.g., credit, respect) that often motivate open source work are undermined.
You mean people making contributions to solve problems and scratch each others' itches got displaced by people seeking social status and/or a do-at-your-own-pace accreditation outside of formal structures, to show to prospective employees? And now that LLMs start letting people solve their own coding problems, sidestepping their whole social game, the credit seekers complain because large corps did something they couldn't possibly have done?
I mean sure, their contributions were a critical piece - in aggregate - individually, any single piece of OSS code contributes approximately 0 value to LLM training. But they're somehow entitled to the reward for a vastly greater value someone is providing, just because they retroactively feel they contributed.
Or, looking from a different angle: what the complainers are saying is, they're sad they can't extract rent now that their past work became valuable for reasons they had no part in, and if they could turn back time, they'd happily rent-seek the shit out of their code, to the point of destroying LLMs as a possibility, and denying the world the value LLMs provided?
I have little sympathy for that argument. We've been calling out "copyright laundering" way before GPT-3 was a thing - those who don't like to contribute without capturing all the value for themselves should've moved off GitHub years ago. It's not like GitHub has any hold over OSS other than plain inertia (and the egos in the community - social signalling games create a network effect).
> individually, any single piece of OSS code contributes approximately 0 value to LLM training. But they're somehow entitled to the reward for a vastly greater value someone is providing, just because they retroactively feel they contributed.
You are attributing arguments to people which they never made. The most lenient of open source licenses require a simple citation, which the "A.I." never provides. Your tone comes off as pretty condescending, in my opinion. My summary of what you wrote: "I know they violated your license, but too bad! You're not as important as you think!"
>Or, looking from a different angle: what the complainers are saying is, they're sad they can't extract rent now that their past work became valuable for reasons they had no part in, and if they could turn back time, they'd happily rent-seek the shit out of their code,
Wrong and completely unfair/bitter accusation. The only people rent seeking are the corporations.
What kind of world do you want to live in? The one with "social games" or the one with corporate games? The one with corporate games seems to have less and less room for artists, musicians, language graduates, programmers...
I deleted my github 2 weeks ago, as much about AI, as about them forcing 2FA. Before AI it was SAAS taking more than they were giving. I miss the 'helping each other' feel of these code share sites. I wonder where are we heading with all this. All competition and no collaboration, no wonder the planet is burning.
Migration is on my todo list, but it’s non trivial enough I’m not sure when I’ll ever have cycles to even figure out the best option. Gitlab? Self-hosted Git? Go back to SVN? A totally different platform?
Truth be told, Git is a major pain in the ass anyway and I’m very open to something else.
I think in this regard it works just fine. If the laws move to say that "learning from data" while not reproducing it is "stealing", then yes, you reading others code and learning from it is also stealing.
If I can't feed a news article into a classifier to teach it to learn whether or not that I would like that article that's not a world I want to live in. And yes it's exactly the same thing as what you are accusing LLMs of.
They should be subject to laws the same way humans are. If they substantially reproduce code they had access to then it's a copyright violation. Just like it would be for a human doing the same. But highly derived code is not "stolen" code, neither for AI nor for humans.
Me teaching my brain someone’s way of syntactically expressing procedures is analogous to AI developers teaching their model that same mode of expression.
To me, the argument is a LLM learning from GPL stuff == creating a derivative of the GPL code, just "compressed" within the LLM. The LLM then goes on to create more derivatives, or it's being distributed (with the embedded GPL code).
Yes, I provide it as a service to my employer. It's called a job. Guess what? When I read code I learn from it and my brain doesn't care what license that code is under.
It's not your reading that would be illegal, but your copying. This is well a documented area of the law and there are concrete answers to your questions.
It'll be okay. We "destroyed" photography by uploading to places like Instagram and Facebook but photography as a whole is still alive. It turns out even though there is lots of stealing, the world still spins and people still seek out original creators.
I don't understand the case being made here at all. AI is violating FOSS licenses, I totally agree. But you can write more FOSS using AI. It's totally unfair, because these companies are not sharing their source, and extracting all of the value from FOSS as they can. Fine. But when it comes to OSI Open Source, all they usually had to do was include a text file somewhere mentioning that they used it in order to do the same thing, and when it comes to Free Software, they could just lie about stealing it and/or fly under the radar.
Free software needs more user-facing software, and it needs people other than coders to drive development (think UI people, subject matter specialists, etc.), and AI will help that. While I think what the AI companies are doing is tortious, and that they either should be stopped from doing it or the entire idea of software copyright should be re-examined, I also think that AI will be massively beneficial for Free Software.
I also suspect that this could result in a grand bargain in some court (which favors the billionaires of course) where the AI companies have to pay into a fund of some sort that will be used to pay for FOSS to be created and maintained.
Lastly, maybe Free Software developers should start zipping up all of the OSI licenses that only require that a license be included in the distribution and including that zipfile with their software written in collaboration with AI copilots. That and your latest GPL for the rest (and for your own code) puts you in as safe a place as you could possibly be legally. You'll still be hit by all of the "don't do evil"-style FOSS-esque licenses out there, but you'll at least be safer than all of the proprietary software being written with AI.
I don't know what textbook directs you to eliminate all of your competition by lowering your competition's costs, narrowing your moat of expertise, and not even owning a piece of that.
edit: that being said, I'm obviously talking about Free Software here, and not Open Source. Wasn't Open Source only protected by spirits anyway?
This is a standard “commoditize your complement” play. It’s in GitHub / Microsoft’s best interest to make sure none of the LLMs become dominant.
As long as that happens, their competitors light money on fire to build the model while GitHub continues to build / defend its monopoly position.
Also, given that there are already multiple companies building decent models, it’s a pretty safe bet Microsoft could build their own in a year or two if the market starts settling on one that’s a strategic threat.
See also: “embrace, extend, extinguish” from the 1990’s Microsoft antitrust days.
You mean waste a few billion on buying a company that couldn't compete with the market anymore because the iphone made "even an idiot should be able to use this thing, and it should be able to do pretty much everything" a baseline expectation with an OS/software experience to match? Nokia failed Nokia, and then Microsoft gave it a shot. And they also couldn't make it work.
(sure, that glosses over the whole Elop saga, but Microsoft didn't buy a Nokia-in-its-prime and killed it. They bought an already failing business and even throwing MS levels of resources at it couldn't turn it around)
> “The size of the Lego blocks that Copilot on AI can generate has grown [...] It certainly cannot write a whole GitHub or a whole Facebook, but the size of the building blocks will increase”
Um, that would make it less capable, not more... /thatguy
Yet another confirmation that AI models are nothing but commodities.
There's no moat, none.
I'm really curious how can any company building models hope to have any meaningful return from their billion dollars investments, when few people leaving and getting enough azure credits can get create a competitor in few months.
I think Bloomberg’s at fault: “cut a deal” isn’t usually that ambiguous because it’s clear which state transition is more likely. But here it’s plausible they could’ve been ending some existing training-data-sharing agreement, or that they were making a new different deal. Also the fact it’s pluralised here makes it different enough to the most common form for it to be a bit harder to notice the idiom. But since we can’t change the fact they used that title, I would like HN to change it now.
That’s a strange usage of the word “cuts”. I thought GitHub terminated the deals with Google and Anthropic. It would be better if the title were GitHub signs AI deals instead of cuts.
That’s correct. Not a native speaker. I am not well versed with slang words. I am sometimes embarrassed because I speak as if they are words from a book instead of sounding like spoken words. Do you know how cuts came to mean that it’s a deal. For a non-native speaker it means the exact opposite thing as in “he cut a wire”. Language evolves in strange ways.
"Cut a deal" is an idiom, not slang: it's appropriate language to use in a business context, for example.
The origin is hazy, of the theories I've seen I consider this the best one: "deal" means both "an agreement" and "to distribute cards in a card game". The dealer, in the latter sense, first cuts the card deck then deals the card. "Cut and deal" -> "cut a deal".
It could also be related to "cut a check", which comes from an era before perforated paper was widespread, when one would literally cut the check out of a book of checks.
As an aside, "closing" and "concluding" a deal or sale also usually mean to successfully reach an agreement. It's more of a semantic quirk around deals than an isolated idiom.
But when you make it "cut AI deal", that breaks the standard phrase and opens the door to alternative explanations. I initially thought this was a news article about the deal breaking up.
The reason here is Microsoft is trying to make copilot a platform.
This is the essential step to moving all the power from OpenAI to Microsoft. It would grant Microsoft leverage over all providers since the customers would depend on Microsoft and not OpenAI or Google or Anthropic. Classic platform business evolution at play here.
I'm sure there are multiple reasons, including lowering the odds of antitrust action by regulators. The EU was already sniffing around Microsoft's relationship with OpenAI.
The Copilot team probably thinks of Cursor's efforts as cute. They can be a neat little product in their tiny corner of the market.
It's far more valuable to be a platform. Maybe Cursor can become a platform, but the race is on and they're up against giants that are moving rather surprisingly nimbly.
Github does way more, you can build on top of it, and they already have a metric ton of business relationships and enterprise customers.
A developer will spend far more time in the IDE than the version control system so I wouldn't discount it that easily. That being said, there are no network effects for an IDE and Cursor is basically just a VSCode plugin. Maybe Cursor gets a nice acquihire deal
"Cuts " ... leads to the initial parsing of "cuts all ties with" or similar "severs relationship with".
When with additional modifiers between "cuts" and "deal" the "cuts deal with" becomes harder to recognize as the "forms a deal with" meaning of the phase.
Yeah, I was expecting outrage when I first clicked into the thread to glance at the comments, and then I was like "wait, why are people saying it's exciting?"
Doesn’t sound natural to me, and I couldn’t find any examples online using that phrasing to mean someone was removed from a deal. You can be cut from a team, though.
I don’t like using AI assistants in my editor; I prefer to keep it as clean as possible. So, I manually copy relevant parts of the code into ChatGPT, ask my question, and continue interacting until I get what I need. It’s a bit manual, but since I use GPT for other tasks, it’s convenient to have a single interface for everything.
You mean "Microsoft" cuts deals with Google and Anthropic on top of their already existing deals with Mistral, Inflection whilst also having an exclusivity deal with OpenAI?
This is an extend to extinguish round 4 [0], whilst racing everyone else to zero.
I replaced ChatGPT Plus with hosted nvidia/Llama-3.1-Nemotron-70B-Instruct for coding tasks. Nemotron produces good code. The cost different is massive. Nemotron is available for $0.35 per Mtoken in and out. ChatGPT is considerably more expensive.
https://archive.is/Il4QM
I use cursor and its tab completion; while what it can do is mind blowing, in practice I’m not noticing a productivity boost.
I find that ai can help significantly with doing plumbing, but it has no problems with connecting the pipes wrong. I need to double and triple check the updated code - or fix the resulting errors when I don’t do that. So: boilerplate and outer app layers, yes; architecture and core libraries, no.
Curious, is that a property of all ai assisted tools for now? Or would copilot, perhaps with its new models, offer a different experience?
I'm actually very curious why AI use is such a bi-modal experience. I've used AI to move multi thousand line codebases between languages. I've created new apps from scratch with it.
My theory is the willingness to baby sit and the modality. I'm perfectly fine telling the tool I use its errors and working side by side with it like it was another person. At the end of the day it can belt out lines of code faster than I, or any human, can and I can review code very quickly so the overall productivity boost has been great.
It does fundamentally alter my workflow. I'm very hands off keyboard when I'm working with AI in a way that is much more like working with someone or coaching someone to make something instead of doing the making myself. Which I'm fine with but recognize many developers aren't.
I use AI autocomplete 0% of the time as I found that workflow was not as effective as me just writing code, but most of my most successful work using AI is a chat dialogue where I'm letting it build large swaths of the project a file or parts of a file at a time, with me reviewing and coaching.
As a programmer of over 20 years - this is terrifying.
I'm willing to accept that I just have "get off my lawn" syndrome or something.
But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible. Whenever I sit down to write some code, be it a large implementation or a small function, I think about what other people (or future versions of myself) will struggle with when interacting with the code. Is it clear and concise? Is it too clever? Is it too easy to write a subtle bug when making changes? Have I made it totally clear that X is relying on Y dangerous behavior by adding a comment or intentionally making it visible in some other way?
It goes the other way too. If I know someone well (or their style) then it makes evaluating their code easier. The more time I spend in a codebase the better idea I have of what the writer was trying to do. I remember spending a lot of time reading the early Redis codebase and got a pretty good sense of how Salvatore thinks. Or altering my approaches to code reviews depending on which coworker was submitting it. These weren't things I were doing out of desire but because all non-trivial code has so much subtlety; it's just the nature of the beast.
So the thought of opening up a codebase that was cobbled together by an AI is just scary to me. Subtle bugs and errors would be equally distributed across the whole thing instead of where the writer was less competent (as is often the case). The whole thing just sounds like a gargantuan mess.
Change my mind.
> The more time I spend in a codebase the better idea I have of what the writer was trying to do.
This whole thing of using LLMs to Code reminds me a bit of when Google Translate came out and became popular, right around the time I started studying Russian.
Yes, copying and pasting a block of Russian text produced a block of english text that you could get a general idea of what was happening. But translating from english to russian rarely worked well enough to fool the professor because of all the idioms, style, etc. Russian has a lot of ways you can write "compactly" with fewer words than english and have a much more precise meaning of the sentence. (I always likened russian to type-safe haskell and english to dynamic python)
If you actually understood Russian and read the text, you could uncover much deeper and subtle meaning and connections that get lost in translation.
If you went to russia today you could get around with google translate and people would understand you. But you aren't going to be having anything other than surface level requests and responses.
Coding with LLMs reminds me a lot of this. Yes, they produce something that the computer understands and runs, but the meaning and intention of what you wanted to communicate gets lost through this translation layer.
Coding is even worse because i feel like the intention of coding should never to be to belt out as many lines as possible. Coding has powerful abstractions that you can use to minimize the lines you write and crystalize meaning and intent.
> the intention of coding should never to be to belt out as many lines as possible
That’s such an underrated statement. Especially when you consider the amount of code as a liability that you’ll have to take care later.
This presumes that it will be real humans that have to “take care” of the code later.
A lot of the people that are hawking AI, especially in management, are chasing a future where there are no humans, because AI writes the code and maintains the code, no pesky expensive humans needed. And AI won’t object to things like bad code style or low quality code.
I think this is a bit short sighted, but I’m not sure how short. I suspect in the future, code will be something in between what it is today, and a build artifact. Do you have to maintain bytecode?
I've heard a similar sentiment: "It's not lines of code written, it's lines of code spent."
It also reminds me of this analogy for data, especially sensitive data: "it's not oil, it's nuclear waste."
I had the opposite experience lately:
I was helping translate some UI text for a website from English to German, my mother tongue. I found that usually the machine came up with better translations than me.
Perhaps you are not a translator. Translating is a skill that is more than simply being bilingual.
I am a professional translator, and I have been using LLMs to speed up and, yes, improve my translations for a year and a half.
When properly prompted, the LLMs produce reasonably accurate and natural translations, but sometimes there are mistakes (often the result of ambiguities in the source text) or the sentences don’t flow together as smoothly as I would like. So I check and polish the translations sentence by sentence. While I’m doing that, I sometimes encounter a word or phrase that just doesn’t sound right to me but that I can’t think how to fix. In those cases, I give the LLMs the original and draft translation and ask for ten variations of the problematic sentence. Most of the suggestions wouldn’t work well, but there are usually two or three that I like and that are better than what I could come up with on my own.
Lately I have also been using LLMs as editors: I feed one the entire source text and the draft translation, and I ask for suggestions for corrections and improvements to the translation. I adopt the suggestions I like, and then I run the revised translation through another LLM with the same prompt. After five or six iterations, I do a final read-through of the translation to make sure everything is okay.
My guess is that using LLMs like this cuts my total translation time by close to half while raising the quality of the finished product by some significant but difficult-to-quantify amount.
This process became feasible only after ChatGPT, Claude, and Gemini got longer context windows. Each new model release has performed better than the previous one, too. I’ve also tried open-weight models, but they were significantly worse for Japanese to English, the direction I translate.
Although I am not a software developer, I’ve been following the debates on HN about whether or not LLMs are useful as coding assistants with much interest. My guess is that the disagreements are due partly to the different work situations of the people on both sides of the issue. But I also wonder if some of those who reject AI assistance just haven’t been able to find a suitable interactive workflow for using it.
This is a great analogy. I find myself thinking that by abstracting the entire design process when coding something using generative AI tools, you tend to lose track of fine details by only concentrating on the overall function.
Maybe the code works, but does it integrate well with the rest of the codebase? Do the data structures that it created follow the overall design principles for your application? For example, does it make the right tradeoffs between time and space complexity for this application? For certain applications, memory may be an issue and while code the may work, it uses too much memory to be useful in practice.
These are the kind of problems that I think about, and it aligns with your analogy. There is in fact something "lost through this translation layer".
> But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible
I heard a similar thing from a dude when I said I use it for bash scripts instead of copying and pasting things off StackOverflow.
He was a bit "get off my lawny" about the idea of running any code you didn't write, especially bash scripts in a terminal.
It is obviously the case that I didn't write most of the code in the world by a very large margin, but even not taking it to extremes if I'm working on a team and people are writing code how is it any different? Everyone makes mistakes, I make mistakes.
I think it's a bad idea to run things that you don't at least understand what it's going to do but the speed with which ChatGPT can produce, for example, gcloud shell commands to manage resources is lightning fast (all of which is very readable, just takes a while if you want to look it up and compose the commands yourself).
If your quality control method is "making sure there are no mistakes" then it's already broken regardless of where the code comes from. Me reviewing AI code is no different from me reviewing anyone else's code.
Me testing AI code using unit or integration tests is no different from testing anyone else's code, or my own code for that matter.
Multiple times in my s/w development career, I've had supervisors ask me why I am not typing code throughout the work day.
My response each time was along the lines of:
> Me reviewing AI code is no different from me reviewing anyone else's code.
I take your point, and on the whole I agree with your post, but this point is fundamentally _not_ correct, in that if I have a question about someone else's code I can ask them about their intention, state-of-mind, and understanding at the time they wrote it, and (subjectively, sure; but I think this is a reasonable claim) can _usually_ detect pretty well if they are bullshitting me when they respond. Asking AI for explanations tends to lead to extremely convincing and confident false justifications rather than an admission of error or doubt.
However:
> Me testing AI code using unit or integration tests is no different from testing anyone else's code, or my own code for that matter.
This is totally fair
I think it depends on the stakes of what you're building.
A lot of the concerns you describe make me think you work in a larger company or team and so both the organizational stakes (maintenance, future changes, tech debt, other people taking it over) and the functional stakes (bug free, performant, secure, etc) are high?
If the person you're responding to is cranking out a personal SaaS project or something they won't ever want to maintain much, then they can do different math on risks.
And probably also the language you're using, and the actual code itself.
Porting a multi-thousand line web SaaS product in Typescript that's just CRUD operations and cranking out web views? Sure why not.
Porting a multi-thousand line game codebase that's performance-critical and written in C++? Probably not.
That said, I am super fascinated by the approach of "let the LLM write the code and coach it when it gets it wrong" and I feel like I want to try that.. But probably not on a work project, and maybe just on a personal project.
> Porting a multi-thousand line web SaaS product in Typescript that's just CRUD operations and cranking out web views? Sure why not. > > Porting a multi-thousand line game codebase that's performance-critical and written in C++? Probably not.
From my own experience:
I really enjoy CoPilot to support me writing a terraform provider. I think this works well because we have hundreds of existing terraform providers with the same boilerplate and the same REST-handling already. Here, the LLM can crank out oodles and oodles of identical boilerplate that's easy to review and deal with. Huge productivity boost. Maybe we should have better frameworks and languages for this, but alas...
I've also tried using CoPilot on a personal Godot project. I turned it off after a day, because it was so distracting with nonsense. Thinking about it along these lines, I would not be surprised if this occurred because the high-level code of games (think what AAA games do in Lua, and well what Godot does in GDScript) tends to be small-volume and rather erratic within there. Here there is no real pattern to follow.
This could also be a cause for the huge difference in LLM productivity boosts people report. If you need Spring Boot code to put query params into an ORM and turn that into JSON, it can probably do that. If you need embedded C code for an obscure micro controller.. yeah, good luck.
> If you need embedded C code for an obscure micro controller.. yeah, good luck.
... or even information in the embedded world. LLMs need to generate something, o they'll generate code even when the answer is "no dude, your chip doesn't support that".
> they'll generate code even when the answer is "no dude, your chip doesn't support that".
This is precisely the problem. As I point out elsewhere[0], reviewing AI-generated code is _not_ the same thing as reviewing code written by someone else, because you can ask a human author what they were thinking and get a moderately-honest response; whereas an AI will confidently and convincingly lie to you.
[0] https://news.ycombinator.com/item?id=41991750
I am quite interested in how LLMs would handle game development. Coming to game development from a long career in boutique applications and also enterprise software, game development is a whole different level of "boutique".
I think both because of the coupled, convoluted complexity of much game logic, and because there are fewer open source examples of novel game code available to train on, they may struggle to be as useful.
> A lot of the concerns you describe make me think you work in a larger company or team and so both the organizational stakes (maintenance, future changes, tech debt, other people taking it over) and the functional stakes (bug free, performant, secure, etc) are high?
The most financially rewarding project I worked on started out as an early stage startup with small ambitions. It ended up growing and succeeding far beyond expectations.
It was a small codebase but the stakes were still very high. We were all pretty experienced going into it so we each had preferences for which footguns to avoid. For example we shied away from ORMs because they're the kind of dependency that could get you stuck in mud. Pick a "bad" ORM, spend months piling code on top of it, and then find out that you're spending more time fighting it than being productive. But now you don't have the time to untangle yourself from that dependency. Worst of all, at least in our experience, it's impossible to really predict how likely you are to get "stuck" this way with a large dependency. So the judgement call was to avoid major dependencies like this unless we absolutely had to.
I attribute the success of our project to literally thousands of minor and major decisions like that one.
To me almost all software is high stakes. Unless it's so trivial that nothing about it matters at all; but that's not what these AI tools are marketing toward, are they?
Something might start out as a small useful library and grow into a dependency that hundreds of thousands of people use.
So that's why it terrifies me. I'm terrified of one day joining a team or wanting to contribute to an OSS project - only to be faced with thousands of lines of nonsensical autogenerated LLM code. If nothing else it takes all the joy out of programming computers (although I think there's a more existential risk here). If it was a team I'd probably just quit on the spot but I have that luxury and probably would have caught it during due diligence. If it's an OSS project I'd nope out and not contribute.
adding here due to some resonance with the point of view.. this exchange lacks crucial axes.. what kind of programming ?
I assume the parent-post is saying "I ported thousands of lines of <some C family executing on a server> to <python on standard cloud environments>. I could be very wrong but that is my guess. Like any data-driven software machinery, there is massive inherent bias and extra resources for <current in-demand thing> in this guess-case it is python that runs on a standard cloud environment with the loaders and credentials parts too perhaps.
Those who learned programming in the theoretic ways know that many, many software systems are possible in various compute contexts. And those working on hardware teams know that there are a lot of kinds of computing hardware. And to add another off-the-cuff idea, so much web interface ala 2004 code to bring to newer, cleaner setups.
I am not <emotional state descriptor> about this sea change in code generation, but actually code generation is not at all new. It is the blatent stealing and LICENSE washing of a generation of OSS that gets me, actually. Those code generation machines are repeating their inputs. No authors agreed and no one asked them, either.
> But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible.
I do think it is kind of crazy based on what I've seen. I'm convinced LLM is a game changer but I couldn't believe how stupid it can be. Take the following example, which is a spelling and grammar checker that I wrote:
https://app.gitsense.com/?doc=f7419bfb27c8968bae&samples=5
If you click on the sentence, you can see that Claude-3.5 and GPT-4o cannot tell that GitHub is spelled correctly most of the time. It was this example that made me realize how dangerous LLM can be. The sentence is short but Claude-3.5 and GPT-4o just can't process it properly.
Having a LLM rewrite large swaths of code is crazy but I believe with proper tooling to verify and challenge changes, we can mitigate the risk.
I'm just speculating, but I believe GitHub has come to the same conclusion that I have, which is, all models can be stupid, but it is unlikely that all will be stupid at the same time.
I'll take a stab at changing your mind.
AIs are not able to write Redis. That's not their job. AIs should not write complex high performance code that millions of users rely on. If the code does something valuable for a large number of people you can afford humans to write it.
AIs should write low value code that just repeats what's been done before but with some variations. Generic parts of CRUD apps, some fraction of typical frontends, common CI setups. That's what they're good at because they've seen it a million times already. That category constitutes most code written.
This relieves human developers of ballpark 20% of their workload and that's already worth a lot of money.
> I'll take a stab at changing your mind.
Not the parent but this doesn’t seem mind changing, because what you describe is the normal/boring route to slightly better productivity using new tools without the breathless hype. And the 20% increase you mention of course depends a lot on what you’re doing, so for many types of work you’d be much closer to zero.
I’m curious about the claims of “power users” that are talking very excitedly about a brave new world. Are they fooling themselves, or trying to fool others, or working at jobs where 90% of their work is boilerplate drudgery, or what exactly? Inevitably it’s all of the above.. and some small percentage of real power users that could probably teach the rest of us cool stuff about their unique workflows. Not sure how to find the signal in all the noise though.
So personally, if I were to write “change my mind”, what I’d really mean is something like “convince me there are real power users already out there in the wild, using tools that are open to the public today”.
GP mentioned machine assisted translation of a huge code base being almost completely hands-off. If that were true and as easy as advertised then one might expect, for example, that it were trivial to just rewrite media wiki or Wordpress in rails or Django with a few people in a week. This is on the easier side of what I’d confidently label as a game-changingly huge productivity boost btw, and is a soft problem chosen because of the availability of existing code examples, mere translation over original work, etc. Not sure we’re there yet.
I can definitely see the value in letting AI generate low stakes code. I'm a daily CoPilot user and, while I don't let it generate implementations, the suggestions it gives for boilerplate-y things is top notch. Love it as a tool.
My major issue with your position is that, at least in my experience, good software is the sum of even the seemingly low risk parts. When I think of real world software that people rely on (the only type I care about in this context) then it's hard to point a finger at some part of it and go "eh, this part doesn't matter". It all matters.
The alternative, I fear, is 90% of the software we use exhibiting subtle goofy behavior and just being overall unpleasant to use.
I guess an analogy for my concern is what it would look like if 60% of every film was AI generated using the models we have today. Some might argue that 60% of all films are low stakes scenes with simple exposition or whatever. And then remaining 40% are the climax or other important moments. But many people believe that 100% of the film matters - even the opening credits.
And even if none of that were an issue: in my experience it's very difficult to assess what part of an application will/won't be low/high stakes. Imagine being a tech startup that needs to pivot your focus toward the low stakes part of the application that the LLM wrote.
I think your concept of ‘what the AI wrote’ is too large. There is zero chance my one line copilot or three line cursor tab completions are going to have an effect on the overall quality of my codebase.
What it is useful for is doing exactly the things I already know need to happen, but don’t want to spend the effort to write out (at least, not having to do it is great).
Since my brain and focus aren’t killed by writing crud, I get to spend that on more useful stuff. If it doesn’t make me more effective, at least it makes my job more enjoyable.
I'm with you. I use Copilot every day in the way you're describing and I love it. The person I was responding to is claiming to code "hands off" and let the AI write the majority of the software.
> The alternative, I fear, is 90% of the software we use exhibiting subtle goofy behavior and just being overall unpleasant to use.
This sounds like most software honestly.
And that's what LLMs are trained on.
Hahaha
I have to disagree. If there’s that much boilerplate floating around then the tooling should be improved. Pasting over inefficiency with sloppier inefficiency is just a pure waste.
You still use type systems, tests, and code review.
For a lot of use cases it's powerful.
If you ask it to build out a brand new system with a complex algorithm or to perform a more complex refactoring, it'll be more work correcting it than doing it yourself.
But that malformed JSON document with the weird missing quotation marks (so the usual formatters break), and spaces before commas, and the indentation is wild... Give it to an LLM.
Or when you're writing content impls for a game based on a list of text descriptions, copy the text into a block comment. Then impl 1 example. Then just sit back and press tab and watch your profits.
The (mostly useless boilerplate “I’m basically just testing my mocks”) tests are being written by AI too these days.
Which is mildly annoying as a lot of those tests are basically just noise rather than useful tools. Humans have the same problem, but current models are especially prone to it from what I’ve observed
And not enough devs are babysitting the AI to make sure the test cases are useful, even if they’re doing so for the original code it produced
There are very few tutorials on how to do testing and I don't think I have ever seen one that was great. Compared to general coding stuff where there's great tutorials available for all the most common things.
So I think quality testing is just not in the training data at anywhere close to the quantity needed.
> Whenever I sit down to write some code, be it a large implementation or a small function, I think about what other people (or future versions of myself) will struggle with when interacting with the code. Is it clear and concise? Is it too clever? Is it too easy to write a subtle bug when making changes? Have I made it totally clear that X is relying on Y dangerous behavior by adding a comment or intentionally making it visible in some other way?
> It goes the other way too. If I know someone well (or their style) then it makes evaluating their code easier. The more time I spend in a codebase the better idea I have of what the writer was trying to do.
What I believe you are describing is a general definition of "understanding", which I am sure you are aware. And given your 20+ year experience, your summary of:
> So the thought of opening up a codebase that was cobbled together by an AI is just scary to me. Subtle bugs and errors would be equally distributed across the whole thing instead of where the writer was less competent (as is often the case).
Is not only entirely understandable (pardon the pun), but to be expected as algorithms employed lack the crucial bit which you identify - understanding.
> The whole thing just sounds like a gargantuan mess.
As it does to most whom envision having to live with artifacts produced by a statistical predictive text algorithm.
> Change my mind.
One cannot because understanding, as people know it, is intrinsic to each person by definition. It exists as a concept within the person whom possesses it and is defined entirely by said person.
> But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible.
I think this is where the bimodality comes from. When someone says "I used AI to refactor 3000 loc" some take it to mean they used AI in small steps as an accellerator, and others take it to mean a direct copy/paste, fix compile errors and move on.
Treat AI like a mid level engineer that you are pair programming with, who can type insanely fast. Move in small steps. Read through it's code after each small iteration. Ask it to fix things (or fix them yourself if quick and easy). Brainstorm ideas with it etc etc.
It’s really far from mid level. It’s a weird mix of expert at things it trained on, and complete misleading idiot at anything outside.
For a bash script or the first steps of something simple it’s great.
For anything complex at all it’s worse than nothing.
For anything complex, move in small steps.
For anything truly novel, or on a codebase with a very bespoke in house architecture or DSL, yeah you won't get much out of it.
> Change my mind.
Unit, integration, e2e, types and linters would catch most of the things you mention.
Not every software is mission critical, often the most important thing is to go as fast and possible and iterate very quickly. Good enough is better than very good in many cases.
> Unit, integration, e2e, types and linters would catch most of the things you mention.
Who’s writing those?
> But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible.
Why? Presumably you let your coworkers move code around, too, and then you review it? (And vice versa.)
I have 10 years professional experience and I've been writing code for 20 years, really with this workflow I just read and review significantly more code and I coach it when it structures or styles something in a way I don't like.
I'm fully in control and nothing gets committed I haven't read its an extension of me at that point.
Edit: I think the issues you've mentioned typically apply to people too and the answer is largely the same. Talk, coach, put hard fixes in like linting and review approvals.
> Talk, coach, put hard fixes in like linting and review approvals.
And sometimes, when all that doesn’t work? Just do it yourself :)
> As a programmer of over 20 years - this is terrifying. > > I'm willing to accept that I just have "get off my lawn" syndrome or something. > > But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible.
My first thought was that I disagree (though I don't use or like this in-IDE AI stuff) because version control. But then the way people use (or can't use) SVC 'terrifies' me anyway, so maybe I agree? It would be fine correctly handled, but it won't be, sort of thing.
The saying "You can delegate tasks but not responsibility" comes to mind.
You are still responsible for the code AI is writing. It is just that writing code with AI is more like reviewing a PR now.
> But the idea of letting an LLM write/move large swaths of code seems so incredibly irresponsible.
People felt the same about compilers for a long time. And justifiably so, the idea that compilers are reliable is quite a new one, finding compilers bugs used to be pretty common. (Those experimenting with newer languages still get to enjoy the fun of this!)
How about other code generation tools? Presumably you don't take much umbrage with schema generators? Or code generators that take a scheme and output library code (OpenAPI, Protocol buffers, or even COM)? Those can easily take a few dozen lines of input and output many thousands of LoC, and because they are part of an automated pipeline, even if you do want to fix the code up, any fixes you make will be destroyed on the next pipeline run!
But there is also a LOT of boring boilerplate code that can be automated.
For example, the necessary code to create a new server, attach a JSON schema to a POST endpoint, validate a bearer token, and enable a given CORS config is pretty cut and dry.
If I am ramping up on a new backend framework, I can either spend hours learning the above and then copy and paste it forever more into each new project I start up, or I can use an AI to crap the code out for me.
(Actually once I was setting up a new server and I decided to not just copy and paste and to do it myself, I flipped the order of two `use` directives and it cost me at least 4 hours to figure out WTF was wrong....)
> As a programmer of over 20 years
I'm almost up there, and my view is that I have two modes of working:
1. Super low level, where my intimate knowledge of algorithms, the language and framework I'm using, of CPU and memory constraints, all come together to let me write code that is damn near magical.
2. Super high level, where I am architecting a solution using design patterns and the individual pieces of code are functionally very simple, and it is how they are connected together that really matters.
For #1, eh, for some popular problems AI can help (popular optimizations on Stack Overflow).
For #2, AI is the most useful, because I have already broken the problem down into individual bite size testable nuggets. I can have the AI write a lot of the boilerplate, and then integrate the code within the larger, human architected, system.
> So the thought of opening up a codebase that was cobbled together by an AI is just scary to me.
The AI didn't cobble together the system. The AI did stuff like "go through this array and check the ID field of each object and if more than 3 of them are null log an error, increment the ExcessNullsEncountered metric counter, and return an HTTP 400 error to the caller"
Edit: This just happened
I am writing a small Canvas game renderer, and I am having an issue with text above a character's head renders off the canvas. So I had Cursor fix the function up to move text under a character if it would have been rendered above the canvas area.
I was able to write the instructions out to Cursor faster than I could have found a pencil and paper to sketch out what I needed to do.
You. Can. Write. Tests.
How do tests account for cases where I'm looking at a 100 line function that could have easily been written in 20 lines with just as much, if not more, clarity?
It reminds me of a time (long ago) when the trend/fad was building applications visually. You would drag and drop UI elements and define logic using GUIs. Behind the scenes the IDE would generate code that linked everything together. One of the selling points was that underneath the hood it's just code so if someone didn't have access to the IDE (or whatever) then they could just open the source and make edits themselves.
It obviously didn't work out. But not because of the scope/scale (something AI code generation solves) but because, it turns out, writing maintainable secure software takes a lot of careful thought.
I'm not talking about asking an AI to vomit out a CRUD UI. For that I'm sure it's well suited and the risk is pretty low. But as soon as you introduce domain specific logic or non-trivial things connected to the real world - it requires thought. Often times you need to spend more time thinking about the problem than writing the code.
I just don't see how "guidance" of an LLM gets anywhere near writing good software outside of trivial stuff.
> How do tests account for cases where I'm looking at a 100 line function that could have easily been written in 20 lines with just as much, if not more, clarity?
That’s not a failure of the AI writing that 100 line monstrosity, it’s a failure of you deciding to actually use the thing.
If you know what 20 lines are necessary and the AI doesn’t output that, why would you use it?
> How do tests account for cases where I'm looking at a 100 line function that could have easily been written in 20 lines with just as much, if not more, clarity?
If the function is fast to evaluate and you have thorough coverage by tests, you couod iterate on an LLMs that aims to compress it down to a simpler / shorter version that behaves identical to the original function. Of course brevity for the sake of brevity can lead to less code that is not always more clear or simpler to understand than the original —LLMs are very good at mimicing code style, so show them a lot of your own code and ask them to mimic it and you may be surprized.
Finally found a comment down here that I like. I'm also with the notion of tests and also iterating until you get to a solution you like. I also don't see anything particularly "terrifying" that many other comments suggest.
At the end of the day, we're engineers that write complex symbols on a 2d canvas, for something that is (ultimately, even if the code being written is machine to machine or something) used for some human purpose.
Now, if those complex symbols are readable, fully covered in tests, and meets requirements / specifications, I don't see why I should care if a human, an AI, or a monkey generated those symbols. If it meets the spec, it meets the spec.
Seems like most people in these threads are making arguments against others who are describing usage of these tools in a grossly incorrect manner from the get go.
I've said it before in other AI threads that I think (at least half?) of the noise and disagreement around AI generated code is like a bunch of people trying to use a hammer when they needed a screwdriver and then complaining that the hammer didnt work like a screwdriver!!! I just don't get it. When you're dealing with complex systems, i.e, reality, these tools (or any tool for that matter) will never work like a magic wand.
> a bunch of people trying to use a hammer when they needed a screwdriver and then complaining that the hammer didnt work like a screwdriver
When it's being sold as a screwdriver, that's hardly their fault.
How do you write a test for code clarity / readability / maintainability?
Tests aren't a full solution for all the considerations of the above post.
Tests haven’t saved us so far, humans have been writing tests that passed for software with bugs for decades.
More importantly, you can read diffs.
Depending on whether I'm using LLMs from my Emacs or via a tool like Aider, I either review and manually merge offered modifications as diffs (in editor), or review the automatically generated commits (Aider). Either way, I end up reading a lot of diffs and massaging the LLM output on the fly, and nothing that I haven't reviewed gets pushed to upstream.
I mean, people aren't seriously pushing unreviewed LLM-generated code to production? Current models aren't good enough for that.
The most common failure of TDD is that assuming just bolting on more tests will fix the problem of a poorly designed codebase.
Just let the LLM do that too.
Even better you can let the AI write tests.
The most likely explanation is that the code you are writing has low information density and is stringing things together the same way many existing apps have already done.
That isn’t a judgement but trying to use the ai code completion tools for complex systems tasks is almost always a disaster.
I'm not sure how many people are like me, but my attempts to use Copilot have largely been the context of writing code as usual, occasionally getting end-of-line or handful-of-lines completions from it. I suspect there's probably a bigger shift needed, but I haven't seen anyone (besides AI "influencers" I don't trust..?) showing what their day-to-day workflows look like.
Is there a Vimcasts equivalent for learning the AI editor tips and tricks?
Have you tried the chat mode?
The autocomplete is somewhere between annoying and underwhelming for me, but the chat is super useful. Being able to just describe what you're thinking or what you're trying to do and having a bespoke code sample just show up (based on the code in your editor) that you can then either copy/paste in, cherry-pick from or just get inspired by, has been a great productivity booster..
Treat it like a pair programmer or a rubber duck and you might have a better experience. I did!
Yeah using a chat interface
> I'm actually very curious why AI use is such a bi-modal experience
I think it's just that it's better at some things than others. Lucky for people who happen to be working in python/node/php/bash/sql/java probably unlucky for people writing Go and Rust (I'm hypothesising because I don't know Go or Rust nor have I ever used them but when the AI doesn't know something it REALLY doesn't know it, like it goes from being insanely useful to utterly useless).
> I use AI autocomplete 0% of the time as I found that workflow was not as effective as me just writing code, but most of my most successful work using AI is a chat dialogue where I'm letting it build large swaths of the project a file or parts of a file at a time, with me reviewing and coaching.
Me too, the way I use it is more like pair programming.
> I'm perfectly fine telling the tool I use its errors and working side by side with it like it was another person.
This is key. Traditional computing systems are deterministic machines, but AI is a probabilistic machine. So the way you interact and the range, precision, and perspective of the output stretches over a different problem/solution space.
Interesting that you find the conversational approach effective. For me, I'd say 9 out of 10 code conversations get stuck in a loop with me telling the AI the next suggested iteration didn't actually change anything or changed it back to something that was already broken. Do you not experience that so often, of do you have a way to escape that?
I encounter that issue when the chat becomes too long.
Starting a new chat with context and asking your question again typically works for me.
I guess for me it actually takes longer to review code than to write it. So maybe that’s some of the difference.
If you're doing something that appears in it's training model a lot, like building a twitter clone, then it is great. If you're using something brand new like react router 7 then it makes mistakes
My theory is grammatical correctness and specificity. I see a lot of people prompt like this:
"use python to write me a prog that does some dice rolls and makes a graph"
Vs
"Create a Python program that generates random numbers to simulate a series of dice rolls. Export a graph of the results in PNG format."
Information theory requires that you provide enough actual information. There is a minimum amount of work to supply the input. Otherwise, the gaps will get filled in with noise, working, what you want, or not.
For example, maybe someday you could say "write me an OS" and it would work. However, to get exactly what you want, you still have to specify it. You can only compress so far.
> move multi thousand line codebases between languages
i am more curious about why someone do would this
I agree. I am in a very senior role and find that working with AI the same way you do I am many times more productive. Months of work becomes days or even hours of work
> I'm actually very curious why AI use is such a bi-modal experience.
My conspiracy theory is that the positive experiences are exaggerated and come from investors in the Nvidia stock.
Have you tried using chatgpt/etc as a starting point when you're unfamiliar with something? That's where it really excels for me, I can go crazy fast from 0 to ~30 (if we call 60 mvp). For example, the other day I was trying to stream some pcm audio using webaudio and it spit out a mostly functional prototype for me in a few minutes of trying. For me to read through msdn and get to that point would've taken an hour or two, and going from the crappy prototype as a starting point to read up on webaudio let me get an mvp in ~15 mins. I rarely touch frontend web code so for me these tools are super helpful.
On the other hand, I find it just wastes my time in more typical tasks like implementing business logic in a familiar language cause it makes up stdlib apis too often.
This is about the only use case I found it helpful for - saving me time in research, not in coding.
I needed to compare compression ratios of a certain text in a language, and it actually came up with something nice and almost workable. It didn't compile but I forgot why now, I just remember it needing a small tweak. That saved me having to track down the libraries, their APIs, etc.
However, when it comes to actually doing data structures or logic, I find it quicker to just do it myself than to type out what I want to do, and double check its work.
That's a very important caveat. In our modern economy it's difficult to not be a shill in some way, shape, or form, even if you don't quite realize it consciously. It's honestly one of the most depressing things about the stock market.
Theres a big difference between being a happy customer and being a shill.
Holding stock is not being a "happy customer". I may be happy with the headset that I bought, but the difference is that I don't make money if you buy an identical one.
I wasnt talking about holding stock, I was responding to this comment you said:
> In our modern economy it's difficult to not be a shill in some way, shape, or form, even if you don't quite realize it consciously.
Oxford dictionary defines a shill as "an accomplice of a confidence trickster or swindler who poses as a genuine customer to entice or encourage others."
So the difference between someone shilling and being a satisfied customer is an intent to decieve. How is it "difficult to not pose as a genuine customer to entice or encourage others" ?
I credit my past interest in cryptocurrencies for educating me about the essence of the stock market in its purest form. And in fact there are painful parallels with the AI bubble.
It's the subtle errors that are really difficult to navigate. I got burned for about 40 hours on a conditional being backward in the middle of an otherwise flawless method.
The apparent speed up is mostly a deception. It definitely helps with rough outlines and approaches. But, the faster you go, the less you will notice the fine details, and the more assumptions you will accumulate before realizing the fundamental error.
I'd rather find out I was wrong within the same day. I'd probably have written some unit tests and played around with that function a lot more if I had handcrafted it.
>> The apparent speed up is mostly a deception.
When I am able ask a very simple question of an LLM which then prevents me having to context-switch to answer the same simple question myself; this is a big time saver for me but hard-to-quantify.
Anything that reduces my cognitive load when the pressure is on is a blessing on some level.
This might be the measurable "some" non deceptive time saving, whereas most of it is still deceptive in terms of time saved
You could make the same argument for any non-AI driven productivity tool/technique. If we can't trust the user to determine what is and is not time-saving then time-saving isn't a useful thing to discuss outside of an academic setting.
My issue with most AI discussions is they seem to completely change the dimensions we use to evaluate basic things. I believe if we replaced "AI" with "new useful tool" then people would be much more eager to adopt it.
What clicked for me is when I started treating it more like a tool and less like some sort of nebulous pandora's box.
Now to me it's no different than auto completing code, fuzzy finding files, regular expressions, garbage collection, unit testing, UI frameworks, design patterns, etc. It's just a tool. It has weaknesses and it has strengths. Use it for the strengths and account for the weaknesses.
Like any tool it can be destructive in the hands of an inexperienced person or a person who's asking it to do too much. But in the hands of someone who knows what they're doing and knows what they want out of it - it's so freakin' awesome.
Sorry for the digression. All that to say that if someone believes it's a productivity boost for them then I don't think they're being misled.
Except actual studies objectively show efficiency gains, more with junior devs, which make sense. So no, it's not a "deception" but it is often overstated in popular media.
Studies have limitations, in particular they test artificial and narrowly-scoped problems that are quite different from real world work.
And anecdotes are useless. If you want to show me improved studies justifying your claim great, but no I don't value random anecdotes. There are countless conflicting anecdotes (including my own).
I find the opposite, the more senior the more value they offer as you know how to ask the right questions, how to vary the questions and try different tact’s and also observe errors or mistakes
Cognitive load is something people always leave out. I can fuckin code drunk with these things. Or just increase stamina to push farther than I would writing every single line.
Exactly, 1 step forward, 1 step backward. Avoiding edge cases is something that can't be glossed over, and for that I need to carefully review the code. Since I'm accountable for it, and can't skip this part anyway, I'd rather review my own than some chatbot's.
That’s the thing, isn’t it? The craft of programming in the small is one of being intimate with the details, thinking things through conscientiously. LLMs don’t do that.
I find that it depends very heavily on what you're up to. When I ask it to write nix code it'll just flat out forget how the syntax works half way though. But if I want it to troubleshoot an emacs config or wield matplotlib it's downright wizardly, often including the kind of thing that does indicate an intimacy with the details. I get distracted because I'm then asking it:
> I un-did your change which made no sense to me and now everything is broken, why is what you did necessary?
I think we just have to ask ourselves what we want it to be good at, and then be diligent about generating decades worth of high quality training material in that domain. At some point, it'll start getting the details right.
That doesn't work in the tech industry, because almost nothing is decades old, for obvious reasons.
What languages/toolkits are you working with that are less than 10 years old?
Anyhow, it seems to me like it is working. It's just working better for the really old stuff because:
- there has been more time for training data to accumulate
- some of it predates the trend of monetizing data, so there was less hoarding and more sharing
- we used to have mailing lists and now we have discourse/slack. The former makes a better training dataset
It may be that the hard slow way is the only way to get good results. If the modern trends re: products don't have the longevity/community to benefit from it, maybe we should fix that.
Perhaps it should be prompted to then?
Ask it to review its own code for any problems?
Also identify typical and corner cases and generate tests?
Question marks here because I have not used the tool.
The size & depth of each accepted code step is still up to the developer slash prompter
I use Chatgpt for coding / API questions pretty frequently. It's bad at writing code with any kind of non-trivial design complexity.
There have been a bunch of times where I've asked it to write me a snippet of code, and it cheerfully gave me back something that doesn't work for one reason or another. Hallucinated methods are common. Then I ask it to check its code, and it'll find the error and give me back code with a different error. I'll repeat the process a few times before it eventually gets back to code that resembles its first attempt. Then I'll give up and write it myself.
As an example of a task that it failed to do: I asked it to write me an example Python function that runs a subprocess, prints its stdout transparently (so that I can use it for running interactive applications), but also records the process's stdout so that I can use it later. I wanted something that used non-blocking I/O methods, so that I didn't have to explicitly poll every N milliseconds or something.
Honestly I find that when GPT starts to lose the plot it's a good time to refactor and then keep on moving. "Break this into separate headers or modules and give me some YAML like markup with function names, return type, etc for each file." Or just use stubs instead of dumping every line of code in.
That's presumably what o1-preview does? Iterates and checks the result. It takes much longer, but does indeed write slightly better code.
How long are you willing to iterate to get things right?
If it takes almost no cognitive energy, quite a while. Even if it's a little slower than what I can do, I don't care because I didn't have to focus deeply on it and have plenty of energy left to keep on pushing.
I'm constantly having to go back and tell the AI about every mistake it makes and remind it not to reintroduce mistakes that were previously fixed. "no cognitive energy" is definitely not how I would describe that experience.
As my mother used to say, "I love work. I could watch it all day!"
I can see where you are coming from.
Maintaining a better creative + technical balance, instead of see-sawing. More continuous conscious planning, less drilling.
Plus the unwavering tireless help of these AI's seems psychologically conducive to maintaining one's own motivation. Even if I end up designing an elaborate garden estate or a simpler better six-axis camera stabilizer/tracker, or refactoring how I think of primes before attempting a theorem, ... when that was not my agenda for the day. Or any day.
I would love to find out
where programmers are learning this idea:
that writing code fast is ideal.
If it takes 30 years to write one loc, it takes 30 years.
Ideally, it takes 30 years to write zero lines of code.
The best programmers are no programmers!
Alternatively,
more logic does not validate the truth.
Why aren't you writing unit tests just because AI wrote the function? Unit tests should be written regardless of the skill of the developer. Ironically, unit tests are also one area where AI really does help move faster.
High level design, rough outlines and approaches, is the worst place to use AI. The other place AI is pretty good is surfacing api call or function calls you might not know about if you're new to the language. Basically, it can save you a lot of time by avoiding the need for tons of internet searching in some cases.
I have completely the opposite perspective.
Unit tests actually need to be correct, down to individual characters. Same goes with API calls. The API needs to actually exist.
Contrast that with "high level design, rough outlines". Those can be quite vague and hand-wavy. That's where these fuzzy LLMs shine.
That said, these LLM-based systems are great at writing "change detection" unit tests that offer ~zero value (or negative).
The fact that you think "change detection" tests offer zero value speaks volumes. Those may well be the most important use of unit tests. Getting the function correct in the first place isn't that hard for a senior developer, which is often why it's tempting to skip unit tests. But then you go refactor something and oops you broke it without realizing it, some boring obvious edge case, or the like.
These tests are also very time consuming to write, with lots of boilerplate that AI is very good at writing.
https://testing.googleblog.com/2015/01/testing-on-toilet-cha...
"speaks volumes" lol
I think you've misunderstood what he meant by change detection (not GP, could be wrong).
Hard to describe, easy to spot.
Some people write tests that are tightly coupled to their particular implementation.
They might have tons of setup code in each test. So refactoring means each test needs extensive rewrites.
Or there will be loads of asserts that have little to do with the actual thing being tested.
These tests usually have negative value as your only real option as another developer is to simply delete them all and start again.
That's what I would interpret the GP as meaning when they use the phrase "change detection" tests.
> That said, these LLM-based systems are great at writing "change detection" unit tests that offer ~zero value (or negative).
That’s not at all true in my experience. With minimal guidance they put out pretty sensible tests.
> With minimal guidance[, LLM-based systems] put out pretty sensible tests.
Yes and no. They get out all the initial annoying boilerplate of writing tests out of the way, and the tests end up being mostly decent on the surface, but I have to manually tweak the behavior and write most of the important parts myself, especially for non-trivial tricky scenarios.
However, I am not saying this as a point against LLMs. The fact that they are able to get a good chunk of the boring boilerplate parts of writing unit tests out of the way and let me focus on the actual logic of individual tests has been noticeably helpful to me, personally.
I only use LLMs for the very first initial phase of writing unit tests, with most of the work still being done by me. But that initial phase is the most annoying and boring part of the process for me. So even if I still spend 90% of the time writing code manually, I still am very glad for being able to get that initial boring part out of the way quickly, without wasting my mental effort cycles on it.
For now, I mostly use AI as a "faster typist".
If it wants to complete what I wanted to type anyway, or something extremely similar, I just press tab, otherwise I type my own code.
I'd say about 70% of individual lines are obvious enough if you have the surrounding context that this works pretty well in practice. This number is somewhat lower in normal code and higher in unit tests.
Another use case is writing one-off scripts that aren't connected to any codebase in particular. If you're doing a lot of work with data, this comes in very handy.
Something like "here's the header of a CSV file", pass each row through model x, only pass these three fields, the model will give you annotations, put these back in the csv and save, show progress, save every n rows in case of crashes, when the output file exists, skip already processed rows."
I'm not (yet) convinced by AI writing entire features, I tried that a few times and it was very inconsistent with the surrounding codebase. Managing which parts of the codebase to put in its context is definitely an art though.
It's worth keeping in mind that this is the worst AI we'll ever have, so this will probably get better soon.
Reminds me of how I use the satnav when driving.
I don't close my eyes and do whatever it tells me to do. If I think I know better I don't "turn right at the next set of lights" I just drive on as I would have before GPS and eventually realise that I went the wrong way or the satnav realises there was a perfectly valid 2nd/3rd/4th path to get to where I wanted to go.
One off scripts do work very well.
I find chatgpt incredibly useful for writing scripts against well-known APIs, or for a "better stackoverflow". Things like "how do I use a cursor in sql" or "in a devops yaml pipeline, I want to trigger another pipeline. How do I do that?".
But working on our actual codebase with copilot in the IDE (Rider, in my case) is a net negative. It usually does OK when it's suggesting the completion of a single line, but when it decides to generate a whole block it invariably misunderstands the point of the code. I could imagine that getting better if I wrote more descriptive method names or comments, but the killer for me is that it just makes up methods and method signatures, even for objects that are part of publicly documented frameworks/APIs.
I love your framing of it as a "better stackoverflow." That's so true. However, I feel like some of our complaints about accuracy and hidden bugs are temporary pain (12-36) months before the tools truly become mind-blowing productivity multipliers.
Same here. If you need to lookup how to do something in an api I find it much faster to use chatgpt than to try to search through the janky official docs or in some github examples folder. Chatgpt is basically documentation search 2.0.
I haven't used Cursor, but I use Aider with Sonnet 3.5 and also use Copilot for "autocomplete".
I'd highly recommend reading through Aider's docs[0], because I think it's relevant for any AI tool you use. A lot of people harp on prompting, and while a good prompt is important I often see developers making other mistakes like not providing context that's good, correct, or even too much[1].
When I find models are going on the wrong path with something, or "connecting the pipes wrong", I often add code comments that provide additional clarity. Not only does this help future me/devs, but the more I steer AI towards correct results, the fewer problems models seem to have going forward.
Everybody seems to be having wildly different experiences using AI for coding assistance, but I've personally found it to be a big productivity boost.
[0] https://aider.chat/docs/usage/tips.html
[1] https://aider.chat/docs/troubleshooting/edit-errors.html#red...
Totally agree that heavy commenting is the best convention for helping the assistant help you best. I try to comment in a way that makes a file or function into a "story" or kind of a single narrative.
That's super interesting, I've been removing a lot of the redundant comments from the AI results. But adding new more explanatory ones that make it easier for both AI and humans to understand the code base makes a lot of sense in my head.
I was big on writing code to be easy to read for humans, but it being easy to read for AI hasn't been a large concern of mine.
In general I do not find AI a net positive. Other tools seem to do at least as well in general.
it can be used if you want the reliability of a random forum poster. which... sure. knock yourself out. sometimes there's gems in that dirt.
I'm getting _very_ bearish on using LLMs for things that aren't pattern recognition.
Time will tell. As a GitHub Copilot user, I still review the code.
SpaceX's advancements are impressive, from rocket blow up to successfully catching the Starship booster.
Who knows what AI will be capable of in 5-10 years? Perhaps it will revolutionize code assistance or even replace developers
> SpaceX's advancements are impressive, from rocket blow up to successfully catching the Starship booster.
That felt like it was LLM generated since that doesn't have anything to do with the subject being discussed. Not only it's on a different industry but it's a completely different set of problems. We know what's involved in catching a rocket. It's a massive engineering challenge yes, but we all know it can be done(whether or not it makes sense or is economically viable are different issues).
Even going to the Moon – which was a massive project and took massive focus from an entire country to do – was a matter of developing the equipment, procedures, calculations (and yes, some software). We knew back then it could be done, and roughly how.
Artificial intelligence? We don't know enough about "intelligence". There isn't even a target to reach right now. If we said "resources aren't a problem, let's build AI", there isn't a single person on this planet that can tell you how to build such an AI or even which technologies need to be developed.
More to the point, current LLMs are able to probabilistically generate data based on prompts. That's pretty much it. They don't "know" anything about what they are generating, they can't reason about it. In order for "AI" to replace developers entirely, we need other big advancements in the field, which may or may not come.
> Artificial intelligence? We don't know enough about "intelligence".
The problem I have with this objection is that it, like many discussions, conflates LLMs (glorified predictive text) and other technologies currently being referred to as AI, with AGI.
Most of these technologies should still be called machine learning as they aren't really doing anything intelligent in the sense of general intelligence. As you say yourself: they don't know anything. And by inference, they aren't reasoning about anything.
Boilerplate code for common problems, and some not so common ones, which is what LLMs are getting pretty OK at and might in the coming years be very good at, is a definable problem that we understand quite well. And much as we like to think of ourselves as "computer scientists", the vast majority of what we do boils down to boilerplate code using common primitives, that are remarkably similar across many problem domains that might on first look appear to be quite different, because many of the same primitives and compound structures are used. The bits that require actual intelligence are often quite small (this is how I survive as a dev!), or are away from the development coalface (for instance: discovering and defining the problems before we can solve them, or describing the problem & solution such that someone or an "AI" can do the legwork).
> we need other big advancements in the field, which may or may not come.
I'm waiting for an LLM being guided to create a better LLM, and eventually down that chain a real AGI popping into existence, much like the infinite improbability drive being created by clever use of a late version finite improbability generator. This is (hopefully) many years (in fact I'm hoping for at least a couple of decades so I can be safely retired or nearly there!) from happening, but it feels like such things are just over the next deep valley of disillusionment.
Except cursor is the fireworks based on black powder here. It will look good, but as a technology to get you to the moon it seems to look like a dead end. NOTHING (of serious science) seems to indicate LLMs being anything but a dead end with the current hardware capabilites.
So then I ask: What, in qualitative terms, makes you think AI in the current form will be capable of this in 5 or 10 years? Other than seeing the middle of what seems to be an S-curve and going «ooooh shiny exponential!»
> NOTHING (of serious science) seems to indicate LLMs being anything but a dead end with the current hardware capabilites.
In the same sense that black powder sucks as a rocket propellant - but it's enough to demonstrate that iterating on the same architecture and using better fuels will get you to the Moon eventually. LLMs of today are starting points, and many ideas for architectural improvements are being explored, and nothing in serious science suggests that will be a dead end any time soon.
It’s easy to say with hindsight but if all you have is black powder I don’t think it’s obvious those better fuels even exist.
If you look at LLM performance on benchmarks, they keep getting better at a fast rate.[1]
We also now have models of various sizes trained in general matters, and those can now be tuned or fine-tuned to specific domains. The advances in multi-modal AI are also happening very quickly as well. Model specialization, model reflection (chain of thought, OpenAI's new O1 model, etc.) are also undergoing rapid experimentation.
Two demonstrable things that LLMs don't do well currently, are (1) generalize quickly to out-of-distribution examples, (2) catch logic mistakes in questions that look very similar to training data, but are modified. This video talks about both of these things.[2]
I think I-JEPA is a pretty interesting line of work towards solving these problems. I also think that multi-modal AI pushes in a similar direction. We need AI to learn abstractions that are more decoupled from the source format, and we need AI that can reflect and modify its plans and update itself in real time.
All these lines of research and development are more-or-less underway. I think 5-10 years is reasonable for another big advancement in AI capability. We've shown that applying data at scale to simple models works, and now we can experiment with other representations of that data (ie other models or ways to combine LLM inferences).
[1]: https://www.anthropic.com/news/3-5-models-and-computer-use [2]: https://www.youtube.com/watch?v=s7_NlkBwdj8
> or even replace developers
I don't think there will be a 'replace developers, but other work remains extant' moment - not at least for very long at all.
I use it for an unfamiliar programming language and it's very nice. You can also ask it to explain badly documented code.
I think there's a number of factors that make it work as well as it does for me:
- Mostly writing React
- Not using any obscure or new libraries
- Naming things well
- Keeping logic simple
- Leaving a comment at the point where I'm about to make a shift from what the common logic would be
- Getting a feel for when it's going to be able to correctly guess or not (and not even reading it if I think it's going to be wrong)
- Trusting short blocks more than long ones
Tab complete is just one of their proprietary models. I find chat-mode more helpful for refactoring and multi-file updates, even more when I specify the exact files to include.
I'd love an autoselected LLM that is fine-tuned to the syntax I'm actively using -- Cursor has a bit of a head start, but where Github and others can take it could be mindblowing (Cursor's moat is a decent VS Code extension -- I'm not sure it's a deep moat though).
With React, the guesses it makes around what props / types I want where, especially moving from file to file, is worth the price of admission. Everything else it does it icing on the cake. The new import suggestion is much quicker than the Typescript compiler lmao. And it's always the right one, instead of suggesting ones that aren't relevant.
Composer can be hit or miss, but I've found it really good at game programming.
One of the reasons for that may be the price: large code changes with multi turn conversation can eat up a lot of tokens, while those tools charge you a flat price per month. Probably many hacks are done under the hood to keep *their* costs low, and the user experiences this as lower quality responses.
Still the "architecture and core libraries" is rather corner case, something at the bottom of their current sales funnel.
also: do you really want to get equivalent of 1 FTE work for 20 USD per month?:)
> in practice I’m not noticing a productivity boost
I am. Can suddenly do in a weekend what would have taken a week.
> in practice I’m not noticing a productivity boost.
How can this be possible if you literally admit its tab completion is mindblowing?
Isn't really good tab completion good enough for at least a 5% producitvity boost? 10%? 20%?
Select line of code, prompt it to refactor, verify they are good, accept the changes
> How can this be possible if you literally admit its tab completion is mindblowing?
I might suggest that coding doesn't take as much of our time as we might think it does.
Hypothetically:
Suppose coding takes 20% of your total clock time. If you improve your coding efficiency by 10%, you've only improved your total job efficiency by 2%. This is great, but probably not the mind-blowing gain that's hyped by the AI boom.
(I used 20% as a sample here, but it's not far away from my anecdotal experience, where so much of my time is spent in spec gathering, communication, meeting security/compliance standards, etc).
> that's hyped by the AI boom
I always personally thought the AI boom was more about 'what can be done automation/improvement wise for non-tech industries'
> How can this be possible if you literally admit its tab completion is mindblowing?
What about it makes it impossible? I’m impressed by what AI assistants can do - and in practice it doesn’t help me personally.
> Select line of code, prompt it to refactor, verify they are good, accept the changes.
It’s the “verify” part that I find tricky. Do it too fast and you spend more time debugging than you originally gained. Do it too slow and you don’t gain much time.
There is a whole category of bugs that I’m unlikely to write myself but I’m likely to overlook when reading code. Mixing up variable types, mixing up variables with similar names, misusing functions I’m unfamiliar with and more.
How does AI learn from it's mistakes ? Genuine question as i have only briefly used ChatGpt and found it interesting but not usefull.
I think the essential point around impressive vs helpful sums up so much of the discourse around this stuff. Its all just where you fall on the line between "impressive is necessarily good" and "no it isn't".
If someone can eat 20 golf balls that’s impressive but it doesn’t improve my golf game
That is actually very impressive.
> How can this be possible if you literally admit its tab completion is mindblowing?
If I had a knife of perfect sharpness which never dulled, that would be mind-blowing. It also would very likely not make me a better cook.
In experience, I always have to spend time to check and most times it doesn’t do what I need unless it’s very simple asks.
I rarely use the tab completion. Instead I use the chat and manually select files I know should be in context. I am barely writing any code myself anymore.
Just sanity checking that the output and “piping” is correct.
My productivity (in frontend work at least) is significantly higher than before.
Out of curiosity, how long have you been working as a developer? Just that, in my experience, this is mostly true for juniors and mids (depending on the company, language, product etc. etc.). For example, I often find that copilot will hallucinate tailwind classes that don't exist in our design system library, or make simple logical errors when building charts (sometimes incorrect ranges, rarely hallucinated fields) and as soon as I start bringing in 3rd party services or poorly named legacy APIs all hope is lost and I'm better off going it alone with an LSP and a prayer.
AI will have the effect of shifting development effort from authorship to verification. As you note, we've come a long way towards making the writing of the code practically free, but we're going to need to beef up our tools for understanding code already written. I think we've only scratched the surface of AI-assisted program analysis.
honestly. I hate it. I find myself banging my head on the table because of how much endless cycles it goes through and not give me the solution haha. I probably started projects over multiple times because the code it generated was so bad lol.
That's my exact experience with GitHub Copilot. Even boilerplate stuff it sucks at as well. I have no idea why its autocomplete is so bad when it has access to my code, the function signatures, types, etc. It gets stuff wrong all the time. For example, it will just flat out suggest functions that don't exist, neither in the Python core libraries or in my own modules. It doesn't make sense.
I have all but given up on using Copilot for code development. I still do use it for autocomplete and boilerplate stuff, but I still have to review that. So there's still quite a bit of overhead, as it introduces subtle errors, especially in languages like Python. Beyond that, it's failure rate at producing running, correct code is basically 100%.
I'm building a tool in this space and believe it's actually multiple separate problems. From most to least solvable:
1. AI coding tools benefit a lot from explicit instructions/specifications and context for how their output will be used. This is actually a very similar problem to when eg someone asks a programmer "build me a website to do X" and then being unhappy with the result because they actually wanted to do "something like X", and a payments portal, and yellow buttons, and to host it on their existing website. So models need to be given those particular instructions somehow (there are many ways to do it, I think my approach is one of the best so far) and context (eg RAG via find-references, other files in your codebase, etc)
2. AI makes coding errors, bad assumptions, and mistakes just like humans. It's rather difficult to implement auto-correction in a good way, and goes beyond mere code-writing into "agentic" territory. This is also what I'm working on.
3. AI tools don't have architecture/software/system design knowledge appropriate represented in their training data and all the other techniques used to refine the model before releasing it. More accurately, they might have knowledge in the form of eg all the blog posts and docs out there about it, but not skill. Actually, there is some improvement here, because I think o1 and 3.5 sonnet are doing some kind of reinforcement-learning/self-training to get better at this. But it's not easily addressable on your end.
4. There is ultimately a ton of context cached in your brain that you cannot realistically share with the AI model, either because it's not written anywhere or there is just too much of it. For example, you may want to structure your code in a certain way because your next feature will extend it or use it. Or your product is hosted on serving platform Y which has an implementation detail where it tries automatically setting Content-Type response headers by appending them to existing headers, so manually setting Content-Type in the response causes bugs on certain clients. You can't magically stuff all of this into the model context.
My product tries to address all of these to varying extents. The largest gains in coding come from making it easier to specify requirements and self-correct, but architecture/design are much harder and not something we're working on much. You or anybody else can feel free to email me if you're interested in meeting for a product demo/feedback session - so far people really like our approach to setting output specs.
Every single one of these discussion, at some point, devolves to some version of
- <LLM Y> is by far the best. In my extensive usage it is consistently outperforms <LLM X> by at least 2x. The difference is night and day.
Then the immediate child reply:
- What!? You must be holding it wrong. The complete inverse is true for me.
I don't know what to make of this contradiction. We're all using the same 2 things right? How can opinions vary by such a large amount. It makes me not trust any opinion on any other subject (which admittedly is not a bad default state, but who has time to form their own opinions on everything).
Lots of possibilities here.
People are learning to prompt LLMs in ways that produce better results for their LLM of choice, so switching to another one they find their approach no longer works as well.
Or.. LLMs have different personalities in terms of output; some being more or less direct/polite than others, or sounding more or less confident; and that is causing people to perceive a difference that in terms of factual answers may not be different.
Or just personal preference masquerading as intelligence - a classic among software engineers.
Reviewing these conversations is like listening to horse and buggy manufacturers pooh-poohing automobiles:
1. they will scare the horses. a good team of horses is no match for funky 'automobile'
2. how will they be able to deal with our muddy, messy roads
3. their engines are unreliable and prone to breaking down stranding you in the middle and having to do it yourself..
4. their drivers cant handle the speed, too many miles driven means unsafe driving.. we should stick to horses they are manageable.
Meanwhile I'm watching a community of mostly young people building and using tools like copilot, cursor, replit, jacob etc and wiring up LLMs into increasingly more complex workflows.
this is snapshot of the current state, not a reflection of the future- Give it 10 years
Reading this comment is like listening to Tesla in 2014 tell me about how their cars will be driving themselves. Give it 10 years.
It is hard to understand some phenomena if it stands to reduce your income. Even if the LLMs don't improve one bit from here and current state is froze, they are still too good and will be everywhere before we can finish talking of horses and automobiles.
LLMs make my job as a software engineer even more secure. Most of what I do is social and/or understand what is going on. LLMs are a tool to reduce mental load when in VSCode on some tasks. They are like the pilot's autopilot.
LLM takes my job then we have reached the singularity. Jobs wont matter anymore at that point.
Except the automobile in this case only reaches the destination correctly sometimes. They are less likely to reach the destination as the path becomes longer or more complex.
I'm not sure why people can't be humble enough to accept that we don't really know what the future will hold. Just because people have underestimated some new technology in the past doesn't mean that will continue to be true for all new technologies.
The fact that LLMs currently do not really understand the answers they're giving you is a pretty significant limitation they have. It doesn't make them useless, but it means they're not as useful at a lot of tasks that people think they can handle. And that limitation is also fundamental to how LLMs work. Can that be overcome? Maybe. There's certainly a ton of money behind it and a lot of smart people are working on it. But is it guaranteed?
Perhaps I'm wrong and we already know that it's simply a matter of time. I'd love to read an technical explanation for why that is, but I mostly see people rolling their eyes at us mere mortals who don't see how this will obviously change everything as if we're too small minded to understand what's going on.
To be extra clear, I'm not saying LLMs won't be a technological innovation as seismic as the invention of the car. My confusion is why for some there doesn't seem to be room for doubt.
Prospective and retrospective analysis are fundamentally different. It’s easy to point to successes and failures of the past, but that’s not how we predict the concrete future potential of one specific thing.
I dont see a young/old divide when it comes to AI. Altough there is a young/old divide in familial responsibilies and willingness to be a chip on the VC's roulette table.
Absolutely true. The oldest devs I work with are some of the most enthusiastic about using LLM chat to develop. Among the younger devs, they all seem to use it but the amount that can actually produce working code are few.
Now I get a lot of calls from team asking for help fixing some code they got from an AI.. Overall it is improving the code quality from the group, I no longer have to instruct people on basics to set up their approach/solution. Will admit there is a little difficulty dealing with pushback on my guidance because e.g. “well chatgpt said I should use this library” when the core SDK already supports something more recent than the AI was trained on
I think a lot of the criticism is constructive. Many of the limitations won’t just magically go away - we’ll have to build tooling and processes and adjust our way of thinking to get there. Most devs will jump across to anything useful the second it’s ready, I would think
A reminder that we basically built cities around the cars, cause they still need fuel, break and drown in the mud.
What is your similar plan for LLMs?
Analogies always end somewhere, I’m just curious where yours does.
This is pretty exciting. I'm a copilot user at work, but also have access to Claude. I'm more inclined to use Claude for difficult coding problems or to review my work as I've just grown more confident in its abilities over the last several months.
I use both Claude and ChatGPT/GPT-4o a lot. Claude, the model, definitely is 'better' than GPT-4o. But OpenAI provides a much more capable app in ChatGPT and an easier development platform.
I would absolutely choose to use Claude as my model with ChatGPT if that happened (yes, I know it won't). ChatGPT as an app is just so far ahead: code interpreter, web search/fetch, fluid voice interaction, Custom GPTs, image generation, and memory. It isn't close. But Claude absolutely produces better code, only being beaten by ChatGPT because it can fetch data from the web to RAG enhance its knowledge of things like APIs.
Claude's implementation of artifacts is very good though, and I'm sure that is what lead OpenAI to push out their buggy canvas feature.
It’s all a dice game with these things, you have to watch them closely or they start running you (with bad outcomes). Disclaimers aside:
Sonnet is better in the small, by a lot. It’s sharply up from idk, three months ago or something when it was still an attractive nuisance. It still tops out at “Best SO Answer”, but it hits that like 90%+. If it involves more than copy paste, sorry folks, it’s still just really fucking good copy paste.
But for sheer “doesn’t stutter every interaction at the worst moment”? You’ve got to hand it to the ops people: 4o can give you second best in industrial quantity on demand. I’m finding that if AI is good enough, then OpenAI is good enough.
>If it involves more than copy paste, sorry folks, it’s still just really fucking good copy paste.
Are you sure you're using Claude 3.5 Sonnet? In my experience it's absolutely capable of writing entire small applications based off a detailed spec I give it, which don't exist on GitHub or Stack Overflow. It makes some mistakes, especially for underspecified things, but generally it can fix them with further prompting.
I’m quite sure what model revision their API quotes, though serious users rapidly discover that like any distributed system, it has a rhythm to it.
And I’m not sure we disagree.
Vercel demo but Pets is copy paste.
We have entered the era of generic fashionable CRUD framework demo Too Cheap To Hawk.
Are there any good 3rd-party native frontend apps for Claude (on MacOS)? I mean something like ChatGPTs app, not an editor. I guess one option would be to just run Claude iPad app on MacOS.
Jan [0] is MacOS native, open source, similar feel to the ChatGPT frontend, very polished, and offers Anthropic integration (all Claude models).
It also features one-click installation, OpenAI integration, a hub for downloading and running local models, a spec-compatible API server, global "quick answer" shortcut, and more. Really can't recommend it enough!
[0] https://github.com/janhq/jan
You can use https://recurse.chat/ if you have an Apple silicon Mac.
Msty [0] is a really good app - you can use both local or online models and has web search, attachments, RAG, split chats, etc., built-in.
[0] https://msty.app
If you're willing to settle for a client-side only web frontend (i.e. talks directly with APIs of the models you use), TypingMind would work. It's paid, but it's good (see [0]), and I guess you could always go for the self-hosted version and wrap it in an Electron app - it's what most "native" apps are these days anyway (and LLM frontends in particular).
--
[0] - https://news.ycombinator.com/item?id=41988306
I like msty.app. Parallel prompting across multiple commercial and local models plus branching dialogs. Doesn’t do artifacts, etc, though.
Open-WebUI doesn't support Claude natively (only through a series of hacks) but it is absolutely "THE" go-to for a ChatGPT Pro like experience (it is slightly better).
https://github.com/open-webui/open-webui
> But OpenAI provides a much more capable app in ChatGPT and an easier development platform
Which app are you talking about here?
FWIW, I was able to get a decent way into making my own client for ChatGPT by asking the free 3.5 version to do JS for me* before it was made redundant by the real app, so this shouldn't be too hard if you want a specific experience/workflow?
* I'm iOS by experience; my main professional JS experience was something like a year before jQuery came out, so I kinda need an LLM to catch me up for anything HTML
Also, I wanted HTML rather than native for this.
> ChatGPT as an app is just so far ahead: code interpreter, web search/fetch, fluid voice interaction, Custom GPTs, image generation, and memory. It isn't close.
Funny thing, TypingMind was ahead of them for over a year, implementing those features on top of the API, without trying to mix business model with engineering[0]. It's only recently that ChatGPT webapp got more polished and streamlined, but TypingMind's been giving you all those features for every LLM that can handle it. So, if you're looking for ChatGPT-level frontend to Anthropic models, this is it.
ChatGPT shines on mobile[1] and I still keep my subscription for that reason. On desktop, I stick to TypingMind and being able to run the same plugins on GPT-4o and Claude 3.5 Sonnet, and if I need a new tool, I can make myself one in five minutes with passing knowledge of JavaScript[2]; no need to subscribe to some Gee Pee Tee.
Now, I know I sound like a shill, I'm not. I'm just a satisfied user with no affiliation to the app or the guy that made it. It's just that TypingMind did the bloodingly stupid obvious thing to do with the API and tool support (even before the latter was released), and continues to do the obvious things with it, and I'm completely confused as to why others don't, or why people find "GPTs" novel. They're not. They're a simple idea, wrapped in tons of marketing bullshit that makes it less useful and delayed its release by half a year.
--
[0] - "GPTs", seriously. That's not a feature, that's just system prompt and model config, put in an opaque box and distributed on a marketplace for no good reason.
[1] - Voice story has been better for a while, but that's a matter of integration - OpenAI putting together their own LLM and (unreleased) voice model in a mobile app, in a manner hardly possible with the API their offered, vs. TypingMind being a webapp that uses third party TTS and STT models via "bring your own API key" approach.
[2] - I made https://docs.typingmind.com/plugins/plugins-examples#db32cc6... long before you could do that stuff with ChatGPT app. It's literally as easy as it can possibly be: https://git.sr.ht/~temporal/typingmind-plugins/tree. In particular, this one is more representative - https://git.sr.ht/~temporal/typingmind-plugins/tree/master/i... - PlantUML one is also less than 10 lines of code, but on top of 1.5k lines of DEFLATE implementation in JS I plain copy-pasted from the interwebz because I cannot into JS modules.
Have you tried using Cursor with Claude embedded? I can't go back to anything else, it's very nice having the AI embedded in the IDE and it just knows all the files i am working with. Cursor can use GPT-4o too if you want
I too use Claude more frequently than OpenAi GPT4o. I think this is a two fold move for MS and I like it. Claude being more accurate / efficient for me says it's likely they see the same thing, win number 1. The second is with all the OpenAI drama MS has started to distance themselves over a souring relationship (allegedly). If so, this could be a smart move away tactfully.
Either way, Claude is great so this is a net win for everyone.
I'm the same, but had a lot of issues getting structured output from Anthropic. Ended up always writing response processors. Frustrated by how fragile that was, decided to try OpenAI structured outputs and it just worked and since they also have prompt caching now, it worked out very well for my use case.
Anthropic's seems to have addressed the issue using pydantic but I haven't had a chance to test it yet.
I pretty much use Anthropic for everything else.
>The second is with all the OpenAI drama MS has started to distance themselves over a souring relationship (allegedly). If so, this could be a smart move away tactfully.
I agree, this was a tactical move designed to give them leverage over OpenAI.
Yeah, Claude consistently impresses me.
A commenter on another thread mentioned it but it’s very similar to how search felt in the early 2000s. I ask it a question and get my answer.
Sometimes it’s a little (or a lot) wrong or outdated, but at least I get something to tinker with.
I recently tried to ask these tools for help with using a popular library, and both GPT-4o and Claude 3.5 Sonnet gave highly misleading and unusable suggestions. They consistently hallucinated APIs that didn't exist, and would repeat the same wrong answers, ignoring my previous instructions. I spent upwards of 30 minutes repeating "now I get this error" to try to coax them in the right direction, but always ending up in a loop that got me nowhere. Some of the errors were really basic too, like referencing a variable that was never declared, etc. Finally, Claude made a tangential suggestion that made me look into using a different approach, but it was still faster to look into the official documentation than to keep asking it questions. GPT-4o was noticeably worse, and I quickly abandoned it.
If this is the state of the art of coding LLMs, I really don't see why I should waste my time evaluating their confident sounding, but wrong, answers. It doesn't seem like much has improved in the past year or so, and at this point this seems like an inherent limitation of the architecture.
FWIW I almost never ask it to write code for me. I did once to write a matplotlib script and it gave me a similar headache.
I ask it questions mostly about libraries I’m using (usually that have poor documentation) and how to integrate it with other libraries.
I found out about Yjs by asking about different operational transform patterns.
Got some context on the prosemirror plugin by pasting the entire provider class into Claude and asking questions.
It wasn’t always exactly correct, but it was correct enough that it made the process of learning prosemirror, yjs, and how they interact pretty nice.
The “complete” examples it kept spitting out were totally wrong, but the information it gave me was not.
To be clear, I didn't ask it to write something complex. The prompt was "how do I do X with library Y?", with a bit more detail. The library is fairly popular and in a mainstream language.
I had a suspicion that what I was trying to do was simply not possible with that library, but since LLMs are incapable of saying "that's not possible" or "I don't know", they will rephrase your prompt and hallucinate whatever might plausibly make sense. They have no way to gauge whether what they're outputting is actually correct.
So I can imagine that you sometimes might get something useful from this, but if you want a specific answer about something, you will always have to double-check their work. In the specific case of programming, this could be improved with a simple engineering task: integrate the output with a real programming environment, and evaluate the result of actually running the code. I think there are coding assistant services that do this already, but frankly, I was expecting more from simple chat services.
> if you want a specific answer about something
Specific is the specific thing that statistical models are not good at :(
> how do I do X with library Y?
Recent research and anecdotal experience has shown that LLMs perform quite poorly with short prompts. Attention just has more data to work with when there are more tokens. Try extending that question like “I am using this programming language and am trying to do this task with this library. How do I do this thing with this other library”
I realize prompt engineering like this is fuzzy and “magic,” but short prompts have a consistent lower performance.
> In the specific case of programming, this could be improved with a simple engineering task: integrate the output with a real programming environment, and evaluate the result of actually running the code.
Not as simple as you’d think. You’re letting something run arbitrary code.
Tho you should give aider.chat a try if you want to test out that workflow. I found it very very slow.
Well it is volume business. <1% of advanced skill developers will find AI helper useless but for 99% of IT CRUD peddlers these tools are quite sufficient. All in all if employers cut down 15-20% of net development costs by reducing head counts, it will be very worthwhile for companies.
I suspect it will go a different direction.
Codebases are exploding in size. Feature development has slowed down.
What might have been a carefully designed 100kloc codebase in 2018 is now a 500kloc ball of mud in 2024.
Companies need many more developers to complete a decent sized feature than they needed in 2018.
Agree. But we are already in that loop. A 50KLOC properly written "Monolith, hence outdated" app is now 30 micro services of 20KLOC surface + 100KLOC of submerged in terms of convenience libraries with kubernetes, grafana, datadog, servicemesh and so on. From what I am seeing companies are increasingly using off the shelf components so KLOC will keep rising but developer count would not.
It's worse than that. Now the balls of mud are distributed. We get incredibly complex interactions between services which need a lot of infrastructure to enable them, that requires more observability, which requires more infrastructure...
Yeah. You can fit a lot of business logic into a 100kloc monolith written by skilled developers.
Once you start shifting it to micro services the business logic gets spread out and duplicated.
At the same time each micro-service now has its own code to handle rest, graphql, grpc endpoints.
And each downstream call needs error handling and retry logic.
And of course now you need distributed tracing.
And of course now your auth becomes much more complex.
And of course now each service might be called multiple times for the one request - better make them idempotent.
And each service will drift in terms of underlying libraries.
And so on.
Now we have been adding in LLM solutions so there is no consistency in any of the above services.
Each dev rather than look at the existing approaches instead asks Claude and it provides a slightly different way each time - often pulling in additional libraries we have to support.
These days I see so much bad code like a single microservice with 3 different approaches to making a http request.
Sure, but my specific question was fairly trivial, using a mainstream language and a popular library. Most of my work qualifies as CRUD peddling. And yet these tools are still wasting my time.
Maybe I'll have better luck next time, or maybe I need to improve my prompting skills, or use a different model, etc. I was just expecting more from state of the art LLMs in 2024.
Yeah there is a big disconnect between the devs caught up in the hype and the devs who aren't.
A lot of the devs in my office using Claude/gpt are convinced they are so much more productive but they aren't actually producing features or bug fixes any faster.
I think they are just excited about a novel new way to write code.
Conversely I feel that the experience of searching has been degraded by a lot since 2016/17. My these is that, at this time, online spam increased by an order of magnitude
Old style Google search is dead, folks just haven’t closed the casket yet. My index queries are down ~90%. In the future, we’ll look back at LLMs as a major turning point in how people retrieve and consume information.
I still prefer it over using llm. And I would be doubtful that llm search has major benefits over Google search imo
Depends what you want it for.
Right now, I find each tool better at different things.
If I can only describe what I want but don't know key words, LLM are the only solution.
If I need citations, LLMs suck.
Abstractive vs. extractive search.
I think it was the switch from desktop search traffic being dominant to mobile traffic being dominant, that switch happened around the end of 2016.
Google used to prioritise big comprehensive articles on subjects for desktop users but mobile users just wanted quick answers, so that's what google prioritised as they became the biggest users.
But also, per your point, I think those smaller simpler less comprehensive posts are easier to fake/spam than the larger more compreshensible posts that came before.
Ironically, I almost never see quick answers in the top results, mostly it's dragged out pages of paragraph after paragraph with ads inbetween.
Guess who sells the ads…
Winning the war against spam is an arms race. Spam hasn’t spent years targeting AI search yet.
It's getting ridiculous. Half of the time now when I ask AI to search some information for me, it finds and summarizes some very long article obviously written by AI, and lacking any useful information.
I don't think this is necessarily converse to what they said.
The speed with which AI models are improving blows my mind. Humans quickly normalize technological progress, but it's staggering to reflect on our progress over just these two years.
Yes! I'm much more inclined to write one-off scripts for short manual tasks as I can usually get AI to get something useful very fast. For example, last week I worked with Claude to write a script to get a sense of how many PRs my company had that included comprehensive testing. This was borderline best done as a manual task previously, now I just ask Claude to write a short bash script that uses the GitHub CLI to do it and I've got a repeatable reliable process for pulling this information.
I rarely use LLMs for tasks but i love it for exploring spaces i would otherwise just ignore. Like writing some random bash script isn't difficult at all, but it's also so fiddly that i just don't care to do it. It's nice to just throw a bot at it and come back later. Loosely speaking.
Still i find very little use from LLMs in this front, but they do come in handy randomly.
Lots of progress, but I feel like we've been seeing diminishing returns. I can't help but feel like recent improvements are just refinements and not real advances. The interest in AI may drive investment and research in better models that are game-changers, but we aren't there yet.
You're proving GP's point about normalization of progress. It's been two years. We're still during the first iteration of applications of this new tech, advancements didn't have time yet to start compounding. This is barely getting started.
I don't know about you, but o1-preview/o1-mini has been able to solve many moderately challenging programming tasks that would've taken me 30 mins to an hour. No other models earlier could've done that.
It's an improvement but...I've asked it to do some really simple tasks and it'll occasionally do them in the most roundabout way you could imagine. Like, let's source a bash file that creates and reads a state file to do something for which the functionality was already built-in. Say I'm a little skeptical of this solution and plug it into a new o1-preview prompt to double check the solution, and it starts by critiquing the bash script and error handling instead of seeing that the functionality is baked in and it's plainly documented. Other errors have been more subtle.
When it works, it's pretty good, and sometimes great. But when failure modes look like the above I'm very wary of accepting its output.
> I've asked it to do some really simple tasks and it'll occasionally do them in the most roundabout way you could imagine.
But it still does the tasks you asked for, so that's the part that really matters.
I wonder how long people will still protest in these threads that "It doesn't know anything! It's just an autocomplete parrot!"
Because.. yea, it is. However.. it keeps expanding, it keeps getting more useful. Yea people and especially companies are using it for things which it has no business being involved in.. and despite that it keeps growing, it keeps progressing.
I do find the "stochastic parrot" comments slowly dwindle in number and volume with each significant release, though.
Still, i find it weirdly interesting to see a bunch of people be both right and "wrong" at the same time. They're completely right, and yet it's like they're also being proven wrong in the ways that matter.
Very weird space we're living in.
You're conflating three different things.
There's the question, "is an LLM just autocomplete"? The answer to that question is obviously no, but the question is also a strawman - people who actually use LLM's regularly do recognize that there is more to their capabilities than randomized pattern matching.
Separately, there's the question of "will LLM's become AGI and/or become super intelligent." Most people recognize that LLM's are not currently super intelligent, and that there currently isn't a clear path toward making them so. Still, many people seem to feel that we're on the verge of progress here, and feel very strongly that anyone who disagrees is an AI "doomer".
Then there's the question of "are we in an AI bubble"? This is more a matter of debate. Some would argue that if LLM reasoning capabilities plateau, people will stop investing in the technology. I actually don't agree with that view - I think there is a lot of economic value still yet to be realized in AI advancements - I don't think we're on the verge of some sort of AI winter, even if LLM's never become super intelligent.
> Most people recognize that LLM's are not currently super intelligent,
I think calling it intelligent is being extremely generous. Take a look at the following example which is a spelling and grammar checker that I wrote:
https://app.gitsense.com/?doc=f7419bfb27c89&temperature=0.50...
When the temperature is 0.5, both Claude 3.5 and GPT-4o can't properly recognize that GitHub is capitalized. You can see the responses by clicking in the sentence. Each model was asked to validate the sentence 5 times.
If the temperature is set to 0.0, most models will get it right (most of the time), but Claude 3.5 still can't see the sentence in front of it.
https://app.gitsense.com/?doc=f7419bfb27c89&temperature=0.00...
Right now, LLM is an insanely useful and powerful next word predictor, but I wouldn't call it intelligent.
> I think calling it intelligent is being extremely generous ... can't properly recognize that GitHub is capitalized.
Wouldn't this make chimpanzees and ravens and dolphins unintelligent too? You're asking it to do a task that's (mostly) easy for humans. It's not a human though. It's an alien intelligence which "thinks" in our language, but not in the same way we do.
If they could, specialized AI might think we're unintelligent based on how often we fail, even with advanced tools, pattern matching tasks that are trivial for them. Would you say they're right to feel that way?
Animals are capable of learning. LLMs can not. LLM uses weights that are defined during the training process to decide what to do next. LLM cannot self evaluate based on what it has said. You have to create a new message for it to create a new probability path.
Animals have the ability to learn and grow by themselves. LLMs are not intelligent and I don't see how they can be since they just follow the most likely path with randomness (temperature) sprinkled in.
The "statistical parrot" parrots have been demonstrably wrong for years (see e.g. LeCun et al[1]). It's just harder to ignore reality with hundreds of millions of people now using incredible new AI tools. We're approaching "don't believe your lying eyes" territory. Deniers will continue pretending that LLMs are just an NFT-level fad or bubble or whatever. The AI revolution will continue to pass them by. More's the pity.
[1] https://arxiv.org/abs/2110.09485
> Deniers will continue pretending that LLMs are just an NFT-level fad or bubble or whatever. The AI revolution will continue to pass them by. More's the pity.
You should re-read that very slowly and carefully and really think about it. Calling anyone that's skeptical a 'denier' is a red flag.
We have been through these AI cycles before. In every case, the tools were impressive for their time. Their limitations were always brushed aside and we would get a hype cycle. There was nothing wrong with the technology, but humans always like to try to extrapolate their capabilities and we usually get that wrong. When hype caught up to reality, investments dried up and nobody wanted to touch "AI" for a while.
Rinse, repeat.
LLMs are again impressive, for our time. When the dust settles, we'll get some useful tools but I'm pretty sure we will experience another – severe – AI winter.
If we had some optimistic but also realistic discussions on their limitations, I'd be less skeptical. As it is, we are talking about 'revolution', and developers being out of jobs, and superintelligence and whatnot. That's not the level the technology is at today and it is not clear we are going to do anything else other than get stuck in a local maxima.
A trillion dimensional stochastic parrot is still a stochastic parrot.
If these systems showed understanding we would notice.
No one is denying that this form of intelligence is useful.
I don't know how you can say they lack understanding of the world when in pretty much any standardised test designed to measure human intelligence they perform better than the average human. They only thing that don't understand is touch because they're not trained on that, but they can already understand audio and video.
You said it, those tests are designed to measure human intelligence, because we know that there is a correspondence between test results and other, more general tasks - in humans. We do not know that such a correspondence exists with language models. I would actually argue that they demonstrably do not, since even an LLM that passes every IQ test you put in front of it can still trip up on trivial exceptions that wouldn't fool a child.
So they fail in their own way? They're not humans; that's to be expected.
An answer key would outperform the average human but it isn’t intelligent. Tests designed for humans are not appropriate to judge non humans.
No you don’t understand, if i put a billion billion trillion monkeys on typewriters, they’re actually now one super intelligent monkey because they’re useful now!
We just need more monkeys and it will be the same as a human brain.
What does the mass of users change about what it is? How many of these check the results for hallucinations and how many don’t because I part of AI?
More than once these tools fail at tasks a fifth grader could understand
Are you confusing frequency of use with usefulness?
If these tools boost tue productivity where is the output spike of all the companies, the spike in revenue and profits?
How often do we lose the benefit auto text generation to the loop of That’s wrong Oh yes of course, here is the correct version Nope, still wrong Prompt editing?
One service is not really enough -- you need a few to triangulate more often than not, especially when it comes to code using latest versions of public APIs
Phind is useful as you can switch between them -- but only get a handful of o1 and Opus a day which I burn through quick at moment on deeper things -- Phind-405b and 3.5 Sonnet are decent for general use
Switch to Cursor with Claude backend and 5x immediately
I wonder what the rationale for this was internally. More OpenAI issues? competitiveness with Cursor? It seems good for the user to increase competition across LLM providers.
Also ambiguous title. I thought GitHub canceled deals they had in the work. The article is clearly about making a deal, but it's unclear from the article's title.
cursor is kicking vscode's butt because it has multi models. also MS is hedging bets against OpenAI. that relationship is not easy
Or, anti trust fears.
Could be a fight against Llama, which excludes MS and Google in its open license (though I think has done separate pay deals with one or both of them). Meta are notably absent from this announcement.
Try to fight the free good-enough haha. At least that’s the plan of Meta, who does not benefit as much selling this than using this
I usually feel like i can confidently express a change I want in code faster and better than I can explain what I want an AI to do in English. Like if I have a good prompt, these tools work okay, but getting that prompt almost as hard as just writing the code itself often. Do others feel the same struggle?
I’m positive my experience pales in comparison to yours, as I don’t actually code anything beyond the occasional single use script, but YES! I hate trying to explain the exact SQL result I’m looking for or some text modification I need to be able to throw together a CTE since I have read-only access and can’t even build a temp table.
GPT-4o usually does well with SQL. Did a query right first time using cte’s, array functions, table created on the fly in the query, etc.
SQL syntax is fiddly so it’s nice to have a robot do it.
I don’t know how people can claim such huge success using copilot and such. I also own a subscription and tried to use it for coding but all task from spring boot authentication configuration to aws policies and lambdas it failed horribly.
Writing the code myself using proper documentation was the only option.
I wonder if false information is written here in the comments section for certain reasons …
You need to spend time learning how to use it. This is difficult because there's no manual, and there's a widespread implication that it should just magically work well without you having to invest any effort in it.
If you can figure out HOW to invest that effort it becomes really valuable.
I wish I had good resources I could link you to here but I don't, which is a big part of the problem here.
You sound like you’ve had success using this tech for work. Can you tell more about your personal experience, please? I’ve tried ChatGPT a few times a year ago or so, but it was extremely frustrating, and I gave up.
Here's my "How I use LLMs and ChatGPT" series: https://simonwillison.net/series/using-llms/
Also relevant is my ai-assisted-programming tag: https://simonwillison.net/tags/ai-assisted-programming/
Maybe you should set the right expectations to find the right use.
There's plenty of copilot and cursor users out there, and developers are really not the kind of crowd that likes to pay for development tools.
I’ve been using Cody from Sourcegraph to have access to other models, if copilot offers something similar I guess I will switch back to it. I find copilot autocomplete to be more often on point than Cody, but the chat experience with Cody + Sonnet 3.5 is way ahead in may experience
Context is a huge part of the chat experience in Cody, and we're working hard to stay ahead there as well with things like OpenCtx (https://openctx.org) and more code context based on the code graph (defs/refs/etc.). All this competition is good for everyone. :)
Your vscode and rider integrations are fantastic, love the different ways to add context to the chat
I'm a hypocrite because I'm now currently paying for Cody due to their integration with the new OpenAI o1-preview model. I find this model to be mind blowing and it's made me actually focus on the more mundane tasks that come with the job.
Anthropic’s article: https://www.anthropic.com/news/github-copilot
GitHub’s article: https://github.blog/news-insights/product-news/bringing-deve...
Google Cloud’s article: https://cloud.google.com/blog/products/ai-machine-learning/g...
Weird that it wasn’t published on the official Gemini news site here: https://blog.google/products/gemini/
Edit: GitHub Copilot is now also available in Xcode: https://github.blog/changelog/2024-10-29-github-copilot-code...
Discussion here: https://news.ycombinator.com/item?id=41987404
I wonder if behind the choice of calling the human user "mona" there's an Italian XD
https://i.imgur.com/z01xgfl.png
It's Mona Lisa the Octocat: https://github.com/monatheoctocat
Hah, TIL. https://cameronmcefee.com/work/the-octocat/
Google Cloud's article is from tomorrow?
https://cloud.google.com/blog/products/ai-machine-learning/g...
https://i.postimg.cc/RVWSfpvs/grafik.png
It’s October 30th in several parts of the world already. It’s after midnight everywhere GMT+7 onwards.
Obviously! However, Google being an American company, that was surprising. I'm in Europe and am used to seeing newest posts "from yesterday" when they are from the USA. This one is weird.
It says the 29th now
Can it help you work in large messy codebases or is it good only to "build tic tac toe games in 5 mins"?
I am excited about this as I use Claude for coding but what I really like about copilot is if you have a list of something random like:
/* Col1 varchar not null, Col2 int null, Col3 int not nul*/
Then start doing something else like:
| column | type | |—-| —-| | Col1 | varchar |
Then copilot is very good at guessing the rest of the table.
(This isn’t just sql to markdown it works whenever you want to repeat something using parts of another list somewhere in the same doc)
I hope they continues as this has been a game changer for me as it is so quick, really great.
So "cut a deal" means to make a deal rather than sever a deal?
https://www.youtube.com/@IsmoComedian/videos you'll like him
It does, and it was unclear to me, as well. It's a poor title IMO
There's a similar problem with the word "cleave" which can mean both cut apart or stick together.
Sensible.
Big part of competitors' (eg. Aider, Cursor, I imagine also jetbrains) advantage was not being tied to one model as the landscape changed.
After large MS OpenAI investment they could just as easily have put blinders on and doubled down.
Jetbrains is doing its own LLM
Cursor is too! Mixing and matching specialized & flagship models is the way forward.
Github was an early OpenAI design partner. OpenAI developed a custom LLM for them.
It's so interesting that even after that early mover advantage they have to go back to the foundation model providers.
Does this mean that future tech companies have no choice but to do this?
It may not be a model quality issue. It may be that GitHub wants to sell a lot more of Copilot, including to companies who refuse to use anything from OpenAI. Now GitHub can say "Oh that's fine, we have these two other lovely providers to choose from."
Also, after Anthropic and Google sold massive amounts of pre-paid usage credits to companies, those companies want to draw down that usage and get their money's worth. GitHub might allow them to do that through Copilot, and therefore get their business.
I think that the credit scenario is more true for OpenAI than others . Existing Azure commits can be used to buy OpenAI via the marketplace. It will never be as simple for any non Azure partner (Only Github is tying up with Anthropic here not Azure)
GitHub doesn’t even support using those azure managed APIs for copilot today, it is just a license you can buy currently and add to a user license. The best you can do is pay for copilot with existing azure commits .
This seems about not being left behind as other models outpace what copilot can do with their custom OpenAI model that doesn’t seem to getting updated .
Yes, because transfer learning works. A specialized model for X will be subsumed by a general model for X/Y/Z as it becomes better at Y/Z. This is why models which learn other languages become better at English.
Custom models still have use cases, e.g. situations requiring cheaper or faster inference. But ultimately The Bitter Lesson holds -- your specialized thing will always be overtaken by throwing more compute at a general thing. We'll be following around foundation models for the foreseeable future, with distilled offshoots bubbling up/dying along the way.
> This is why models which learn other languages become better at English.
Do you have a source for that, I'd love to learn more!
Evaluating cross-lingual transfer learning approaches in multilingual conversational agent models[1]
Cross-lingual transfer learning for multilingual voice agents[2]
Large Language Models Are Cross-Lingual Knowledge-Free Reasoners[3]
An Empirical Study of Cross-Lingual Transfer Learning in Programming Languages[4]
That should get you started on transfer learning re. languages, but you'll have more fun personally picking interesting papers over reading a random yahoo's choices. The fire hose of papers is nuts, so you'll never be left wanting.
[1] https://www.amazon.science/publications/evaluating-cross-lin...
[2] https://www.amazon.science/blog/cross-lingual-transfer-learn...
[3] https://arxiv.org/pdf/2406.16655v1
[4] https://arxiv.org/pdf/2310.16937v2
I see no reason why GitHub wouldn’t use fine tuned models from google or anthropic.
I think their version of gpt-3.5 was a fine tune as well. I doubt they had a whole model from scratch made just for them.
I still think it’s worth emphasising - LLMs represent a massive capital absorber. Taking gobs of funding into your company is how you grow, how your options become more valuable, how your employees stay with you. If that treadmill were to break bad things happen.
Search has been stuttering for a while - Google’s growth and investment has been flattening - at some point they absorbed all the worlds stored information.
OpenAI showed the new growth - we need billions of dollars to build and the run the LLMs (at a loss one assumes) - the treadmill can keep going
One of the reasons that comes to my mind is - it could have been problematic look for only Microsoft (Copilot) to have access to GitHub for training AI models - à la monopolizing a data treasure trove. With anti-competitive legislation catching up to Google to open up its Play Store, this could have been one of key reasons why this deal came about.
Copilot can choke on my AGPL code on GitHub, that was used for training their proprietary models. I'm still salty about this, sadly looks like the world has largely moved on.
Yet Google and Anthropic wanted in on the huge data that GitHub has to offer. It seems the world has not moved on just yet.
The Claude terms of service [1] apparently preclude Anthropic or AWS using GitHub user data for training:
GitHub Copilot uses Claude 3.5 Sonnet hosted on Amazon Web Services. When using Claude 3.5 Sonnet, prompts and metadata are sent to Amazon's Bedrock service, which makes the following data commitments: Amazon Bedrock doesn't store or log your prompts and completions. Amazon Bedrock doesn't use your prompts and completions to train any AWS models and doesn't distribute them to third parties.
[1] https://docs.github.com/en/copilot/using-github-copilot/usin...
It really feels like a digital form of colonialism; they come in take everything, completely disregard the rules, ignore intellectual copyright laws (while you still have to obey them), but when you speak out against this suddenly you are a luddite that doesn't care about human progress.
It's especially distasteful when we consider lawsuits like Epic vs Silicon Knights. https://en.wikipedia.org/wiki/Silicon_Knights
> Silicon Knights had "deliberately and repeatedly copied thousands of lines of Epic Games' copyrighted code, and then attempted to conceal its wrongdoing by removing Epic Games' copyright notices and by disguising Epic Games' copyrighted code as Silicon Knights' own
> Epic Games prevailed against Silicon Knights' lawsuit, and won its counter-suit for $4.45 million on grounds of copyright infringement,
> following the loss of the court case, Silicon Knights filed for bankruptcy
If it doesn't work, oh well, you'll get VC money for something else.
If it works, the lawyers will figure it out.
Interestingly GitHub (a Microsoft entity) will use Amazon Bedrock to run Claude Sonnet.
> Claude 3.5 Sonnet runs on GitHub Copilot via Amazon Bedrock, leveraging Bedrock’s cross-region inference to further enhance reliability.
[1] https://www.anthropic.com/news/github-copilot
Actually excited 2M context window will be useful in this case
Great news! This can only mean better suggestions.
I expected little from Copilot, but now i find it indispensible. It is such a productivity multiplier.
I removed it from windows and I'm still very productive. Probably moreso, since I don't have to make constant corrections.
To each their own.
GitHub Copilot and Microsoft Copilot are different products
Their branding is confusing
Is Microsoft Copilot even a single product? It seems to me they just shove AI in random places throughout their products and call it Copilot. Which would make Github Copilot essentially another one of these places the branding shows up (even if it started there)
Same difference. They both are glorified liberians.
Liberians seem quite useful, then! I’ve never been to Africa myself.
Just don't lick your fingers.
Got to cut deals before the AI bust pops, VC money and interest vanishes and interest rates go up.
Also diversifying is always a good option. Even if one cash cow gets nuked from orbit, you have 2 other companies to latch onto
> interest rates go up
This is kind of a cynical tech startup take:
- ragging on VC's - calling something a bubble
Interest rates are on their way back down btw.
https://www.federalreserve.gov/newsevents/pressreleases/mone...
https://www.reuters.com/world/uk/bank-england-cut-bank-rate-...
Funding has looked to be running out a few times for OpenAI specifically, but most frontier model development is reasonably well funded still.
If interest rates are on their way down, why has the 10Y treasury yield increased 50 points over the last month? https://www.cnbc.com/quotes/US10Y
Because they previously decreased more under the expectation of another half point cut by the fed. Stronger economic indicators have cut the expectation for steep rate cuts so treasuries are declining.
It also dropped 40 points over the last six months.
>Interest rates are on their way back down btw.
completely wrong, where have you been the past month? 10Y t-notes are actually UP after the fed's hysterical 50 basis point cut lol
Call me eccentric but the only true or utilitarian use case I've found for AI so far is chatgpt. Rest all appear to be shiny toys just trying to bask in the AI glory but none solve any real human problem?
ChatGPT is just good at marketing and they were the first one with good model. Claude is equally good or better
1 point by Fairburn 0 minutes ago | prev | next | edit | delete [–]
I have no doubts that Claude is serviceable from a coders perspective. But for me, as a paid user, I became tired of being told that I have to slow down and then be cut off while actively working on a product. When Anthropic addresses this, Ill add it back to my tools.
I mentored junior SWE and CS students for years, and now using Claude as a coding assistant feels very similar. Yesterday, it suggested implementing a JSON parser from scratch in C to avoid a dependency -- and, unsurprisingly, the code didn’t work. Two main differences stand out: 1) the LLM doesn’t learn from corrections (at least not directly), and 2) the feedback loop is seconds instead of days. This speed is so convenient that it makes hiring junior SWEs seem almost pointless, though I sometimes wonder where we’ll find mid-level and senior developers tomorrow if we stop hiring juniors today.
Does speed matter when it's not getting better and learning from corrections? I think I'd rather give someone a problem and have them come back with something that works in a couple days (answering a question here or there), rather than spend my time doing it myself because I'm getting fast, but wrong, results that aren't improving from the AI.
> though I sometimes wonder where we’ll find mid-level and senior developers tomorrow if we stop hiring juniors today.
This is also a key point. While there is a lot of short term thinking these days, since people don't stick with companies like they used to. As a person who has been with my company for close to 20 years, making sure things can still run once you leave is important from a business perspective.
Training isn't about today, it's about tomorrow. I've trained a lot of people, and doing it myself would always be faster in the moment. But it's about making the team better and making sure more people have more skill, to reduce single points of failure and ensure business continuity over the long-term. Not all of it pays off, but when it does, it pays off big.
Years of experience doesn't correlate to a good developer either. I've seen senior devs using AI to solve impossible problems, for example asking it how to store an API key client side without leaking it...
My anecdote is we have Github Copilot at work and its often hit and miss. I often get out my phone and use chatgpt or claude to get better answers.
Seems to be part of Microsoft’s hedging of its OpenAI bet, ever since Sam Altman’s ousting: https://www.nytimes.com/2024/10/17/technology/microsoft-open...
Microsoft is negotiating equity in OpenAI as part of the switch to a for-profit. Non-zero chance this is a negotiation flex.
I guess this goes to show, nobody really has a moat in this game so far. Everyone is sprinting like crazy but I don't see anyone really gaining a sustainable edge that will push out competitors.
This kind of thing is why I think Sam is often misjudged. You can’t fuck around in such a competitive market. If you go in all kumbaya you’ll get crushed by market forces. It’s rare for company/founder ideals to survive the market indefinitely. I think he’s iterated fast and the job is still very hard.
This Github purchase was incredible
History has shown being first to market isn't all it's cut out to be. You spend more, it's more difficult creating the trail others will follow, you end up with a tech stack that was built before tools and patterns stabilized and you've created a giant super highway for a fast-follower. Anyone remember MapQuest, AltaVista or Hotmail?
OpenAI has some very serious competition now. When you combine that with the recent destabilizing saga they went through along with commoditization of models with services like OpenRouter.ai, I'm not sure their future is as bright as their recent valuation indicates.
Claude is better than OpenAI for most tasks, and yet OpenAI has enormously more users.
What is this, if not first mover advantage?
They seem to be going after different markets, or at least having differing degrees of success in going after different markets.
OpenAI is most successful with consumer chat app (ChatGPT) market.
Anthropic is most successful with business API market.
OpenAI currently has a lot more revenue than Anthropic, but it's mostly from ChatGPT. For API use the revenue numbers of both companies are roughly the same. API success seems more important that chat apps since this will scale with success of the user's business, and this is really where the dream of an explosion in AI profits comes from.
ChatGPT's user base size vs that of Claude's app may be first mover advantage, or just brand recognition. I use Claude (both web based and iOS app), but still couldn't tell you if the chat product even has a name distinct from the model. How's that for poor branding?! OpenAI have put a lot of effort into the "her" voice interface, while Anthropic's app improvements are more business orientated in terms of artifacts (which OpenAI have now copied) and now code execution.
Claude cannot “research” stuff on the web and provide results like 4o does in 5 secs like “which is the cheapest Skoda car and how much”
Just wanted to add a note to this. Tool calling - particularly to source external current data - is something that's had the big foundational LLM providers very nervous so they've held back on it, even though it's trivial to implement at this point. But we're seeing it rapidly emerge with third party providers who use the foundational APIs. Holding back tool calling has limited the complex graph-like execution flows that the big providers could have implemented on their user facing apps e.g. the kind of thing that Perplexity Pro has implemented. So they've fallen behind a bit. They may catch up. If they don't they risk becoming just an API provider.
I'm hoping a lot of the graph-like execution flow engines are still in stealth mode, as believe that's where we'll start to see truly useful AI.
Mass data parsing and reformatting is useful... but building agents that span existing APIs / tools is a lot more exciting to me.
I.e. IFTTT, with automatic tool discovery, parameter mapping, and output parsing handled via LLM
This is what I use phind for.
Most people's first exposure to LLMs was ChatGPT, and that was only what - like 18 months ago it really took off in the mainstream? We're still very early on in the grand scheme of things.
Yes it's silly to talk about first mover advantage in sub 3 years. Maybe in 2026 we can revisit this question and see if being the first mattered.
First mover being a general myth doesn't mean being the first to launch and then immediately dominating the wider market for a long period is impossible. It's just usually means their advantage was about a lot more than simply being first.
Claude is only better in some cherry picked standard eval benchmarks, which are becoming more useless every month due to the likelihood of these tests leaking into training data. If you look at the Chatbot Arena rankings where actual users blindly select the best answer from a random choice of models, the top 3 models are all from OpenAI. And the next best ones are from Google and X.
I'm subscribed to all of Claude, Gemini, and ChatGPT. Benchmarks aside, my go-to is always Claude. Subjectively speaking, it consistently gives better results than anything else out there. The only reason I keep the other subscriptions is to check in on them occasionally to see if they've improved.
I don't pay any attention to leaderboards. I pay for both Claude and ChatGPT and use them both daily for anything from Python coding to the most random questions I can think of. In my experience Claude is better (much better) that ChatGPT in almost all use cases. Where ChatGPT shines is the voice assistant - it still feels almost magical having a "human-like" conversation with the AI agent.
Anecdotally, I disagree. Since the release of the "new" 3.5 Sonnet, it has given me consistently better results than Copilot based on GPT-4o.
I've been using LLMs as my rubber duck when I get stuck debugging something and have exhausted my standard avenues. GPT-4o tends to give me very general advice that I have almost always already tried or considered, while Claude is happy to say "this snippet looks potentially incorrect; please verify XYZ" and it has gotten me back on track in maybe 4/5 cases.
Claude 3.5 Sonnet (New) is meaningfully better than ChatGPT GPT4o or o1.
my experience is that o1 is still slightly better for coding, sonnet new is better for analyzing data, and most other tasks besides coding
3.5 Sonnet, ime, is dramatically better at coding than 4o. o1-preview may be better, but it's too slow.
Bullshit. Claude 3.5 Sonnet owns the competition according to the most useful benchmark: operating a robot body in the real world. No other model comes close.
This seems incorrect. I don't need Claude 3.5 Sonnet to operate a robot body for me, and don't know anyone else who does. And general-purpose robotics is not going to be the most efficient way to have robots do many tasks ever, and certainly not in the short term.
Of course not but the task requires excellent image understanding, large context window, a mix of structured and unstructured output, high level and spatial reasoning, and a conversational layer on top.
I find it’s predictive of relative performance in other tasks I use LLMs for. Claude is the best. The only shortcoming is its peculiar verbosity.
Definitely superior to anything OpenAI has and miles beyond the “open weights” alternatives like Llama.
The problem is that it also fails on fairly simple logic puzzles that ChatGPT can do just fine.
For example, even the new 3.5 Sonnet can't solve this reliably:
> Doom Slayer needs to teleport from Phobos to Deimos. He has his pet bunny, his pet cacodemon, and a UAC scientist who tagged along. The Doom Slayer can only teleport with one of them at a time. But if he leaves the bunny and the cacodemon together alone, the bunny will eat the cacodemon. And if he leaves the cacodemon and the scientist alone, the cacodemon will eat the scientist. How should the Doom Slayer get himself and all his companions safely to Deimos?
In fact, not only its solution is wrong, but it can't figure out why it's wrong on its own if you ask it to self-check.
In contrast, GPT-4o always consistently gives the correct response.
Yeah, but Mistral brews a mean cup of tea, and Llama's easily the best at playing hopscotch.
Claude requires a login, ChatGPT does not.
I think "Claude" is also a bad name. If I knew nothing else, am I picking OpenAI or Claude based on the name? I'm going with OpenAI
Claude is a product name, OpenAI is a company name. You really think Claude is better than ChatGPT?
This brings up the broader question: why are AI companies so bad at naming their products?
All the OpenAI model names look like garbled nonsense to the layperson, while Anthropic is a bit of a mixed bag too. I'm not sure what image Claude is supposed to conjure, Sonnet is a nice name if it's packaged as a creative writing tool but less so for developers. Meta AI is at least to the point, though not particularly interesting as far as names go.
Gemini is kind of cool sounding, aiming for the associations of playful/curious of that zodiac sign. And the Gemini models are about as unreliable as astrology is for practical use, so I guess that name makes the most sense.
Asking Americans to read a French name that is a homonym for “clod” may not be the best mass market decision.
Plot twist: regular users don't care what model underneath is called or how it works.
The name ChatGPT is better than the name Claude, to me. Of course this is all subjective though.
Yes, muscle memory is powerful. But it's not an insurmountable barrier for a follower. The switch from Google to various AI apps like Perplexity being a case in point. I still find myself beginning to reach for Google and then 0.1 seconds later catching myself. As a side note: I'm also catching myself having a lack of imagination when it comes to what is solvable. e.g. I had a specific technical question about github's UX and how to get to a thing that no one would have written about and thus Google wouldn't know, but openAI chat nailed it first try.
Claude is more restricted and can't generate images.
I asked Claude a physics question about bullet trajectory and it refused to answer. Restricted too far imo.
couldn't you s/bullet/ball/ ? or s/bullet/arrow/ ?
You could, but you could also use a model that's not restricted so much that it cannot do simple tasks.
Exactly.
I ended up asking about half pound ball I would throw with a 3600rpm spin and the acceleration phase was 4ms.
It had no issue with that but it was stupid.
Honestly I think the biggest reason for this is that Claude requires you to login via an email link whereas OpenAI will let you just login with any credentials.
This matters if you have a corporate machine and can't access your personal email to login.
It's a short lived first mover advantage.
Given that Hotmail is now Outlook.com, maybe that's a bad example.
Does this mean you need to be a paying user for Claude and Gemini or just with GitHub copilot?
In AI, the only real moat is seeing how many strategic partnerships you can announce before anyone figures out they’re all with the same people.
Claude and Carol and Carol and Carol?
Elseweb with GitHub Copilot today...
Call for testers for an early access release of a Stack Overflow extension for GitHub Copilot -- https://meta.stackoverflow.com/q/432029
If I'm already paying Anthropic can I use this without paying github as well?
Thank you people for contributing to this free software ecosystem. Oh, you can't monetize your work? Your problem, not ours! Deals are made, but you, who provide your free code, we have zero monetization options for you on our github platform. Go pay for copilot which was trained on your data.
I mean, this is the worst farce ever concocted. And people are oblivious what's happening...
We are not oblivious. We are powerless. Oracle could go toe to toe with Google and threaten multibillion fines over basically API and 11kLOC. As a open source developer, there is no way to match that.
Non-paywall alternative: GitHub Copilot will support models from Anthropic, Google, and OpenAI - https://www.theverge.com/2024/10/29/24282544/github-copilot-...
This only makes Copilot more competitive and price effective. Microsoft's business managers are smart.
So GitHub’s teaming up with Google, Anthropic, and OpenAI? Kinda feels Microsoft’s version of a ‘safety net’, but for who exactly? It’s hard not to wonder if this is actually about choice for the user or just insurance for Microsoft
I wonder if this is an example of the freedom of being an arms length subsidiary or foreshadowing to a broader strategy shift within Microsoft.
I replaced chatgpt with mybrain 1.0 and I'm seeing huge improvements in accuracy and reasoning performance!
Also energy efficiency significantly improved, no?
Frankly surprised to see GitHub (Microsoft) signing a deal with their biggest competitor, Google. It does give Microsoft some good terms/pricing leverage over OpenAI, though I'm not sure what degree Microsoft needs that given their investment in OpenAI.
GitHub Spark seems like the most interesting part of the announcement.
On the anthropic blog it say it uses AWS Bedrock.
> Claude 3.5 Sonnet runs on GitHub Copilot via Amazon Bedrock, leveraging Bedrock’s cross-region inference to further enhance reliability.
https://www.anthropic.com/news/github-copilot
Isn’t using big models like gpt-4o going to slow down the autocomplete?
I think they mean for the chat and code editing features.
If you want to destroy open source completely, the more models the better. Microsoft's co-opting and infiltration of OSS projects will serve as a textbook example of eliminating competition in MBA programs.
And people still support it by uploading to GitHub.
Yes. Thank you for saying it. We're watching Microsoft et al. defeat open source.
Large language models are used to aggregate and interpolate intellectual property.
This is performed with no acknowledgement of authorship or lineage, with no attribution or citation.
In effect, the intellectual property used to train such models becomes anonymous common property.
The social rewards (e.g., credit, respect) that often motivate open source work are undermined.
Embrace, extend, extinguish.
> This is performed with no acknowledgement of authorship or lineage, with no attribution or citation.
GitHub hosts a lot of source code, including presumably the code it trained CoPilot on. So they satisfy any license that requires sharing the code and license, such as GPL 3. Not sure what the problem is.
Can you name a company with more OSS projects and contributors? Stop with the hyperbole...
Embrace, extend...
> The social rewards (e.g., credit, respect) that often motivate open source work are undermined.
You mean people making contributions to solve problems and scratch each others' itches got displaced by people seeking social status and/or a do-at-your-own-pace accreditation outside of formal structures, to show to prospective employees? And now that LLMs start letting people solve their own coding problems, sidestepping their whole social game, the credit seekers complain because large corps did something they couldn't possibly have done?
I mean sure, their contributions were a critical piece - in aggregate - individually, any single piece of OSS code contributes approximately 0 value to LLM training. But they're somehow entitled to the reward for a vastly greater value someone is providing, just because they retroactively feel they contributed.
Or, looking from a different angle: what the complainers are saying is, they're sad they can't extract rent now that their past work became valuable for reasons they had no part in, and if they could turn back time, they'd happily rent-seek the shit out of their code, to the point of destroying LLMs as a possibility, and denying the world the value LLMs provided?
I have little sympathy for that argument. We've been calling out "copyright laundering" way before GPT-3 was a thing - those who don't like to contribute without capturing all the value for themselves should've moved off GitHub years ago. It's not like GitHub has any hold over OSS other than plain inertia (and the egos in the community - social signalling games create a network effect).
> individually, any single piece of OSS code contributes approximately 0 value to LLM training. But they're somehow entitled to the reward for a vastly greater value someone is providing, just because they retroactively feel they contributed.
You are attributing arguments to people which they never made. The most lenient of open source licenses require a simple citation, which the "A.I." never provides. Your tone comes off as pretty condescending, in my opinion. My summary of what you wrote: "I know they violated your license, but too bad! You're not as important as you think!"
>Or, looking from a different angle: what the complainers are saying is, they're sad they can't extract rent now that their past work became valuable for reasons they had no part in, and if they could turn back time, they'd happily rent-seek the shit out of their code,
Wrong and completely unfair/bitter accusation. The only people rent seeking are the corporations.
What kind of world do you want to live in? The one with "social games" or the one with corporate games? The one with corporate games seems to have less and less room for artists, musicians, language graduates, programmers...
I deleted my github 2 weeks ago, as much about AI, as about them forcing 2FA. Before AI it was SAAS taking more than they were giving. I miss the 'helping each other' feel of these code share sites. I wonder where are we heading with all this. All competition and no collaboration, no wonder the planet is burning.
> And people still support it by uploading to GitHub.
It’s slowly, but noticeably moving from GitHub to other sites.
The network effect is hard to work against.
Migration is on my todo list, but it’s non trivial enough I’m not sure when I’ll ever have cycles to even figure out the best option. Gitlab? Self-hosted Git? Go back to SVN? A totally different platform?
Truth be told, Git is a major pain in the ass anyway and I’m very open to something else.
Mercurial was better than git IMO, at least for smaller projects.
A classic case of perfect being the enemy of the good. The answers are Gitlab and jj, cheers.
It doesn't matter whether it is uploaded to GitHub or not. They would siphon it from GitLab, self hosting or source forge as well using crawlers.
> If you want to destroy open source completely
The irony is of course that open source is what they used to train their models with.
That was the point. They are laundering IP. It's the long way around the GPL, allowing then to steal.
Maybe there will be legal precedent set at some point around derived work in terms of the set of data used to train the AI? I'm not hopeful though.
How many OSS repositories do I personally have to read through for my own code to be considered stolen property?
That line of thought would get thrown out of court faster than an AI would generate it.
I assume you're not an AI model, but a real human being (I hope). The analogy "AI == human" just... doesn't work, really.
I think in this regard it works just fine. If the laws move to say that "learning from data" while not reproducing it is "stealing", then yes, you reading others code and learning from it is also stealing.
If I can't feed a news article into a classifier to teach it to learn whether or not that I would like that article that's not a world I want to live in. And yes it's exactly the same thing as what you are accusing LLMs of.
They should be subject to laws the same way humans are. If they substantially reproduce code they had access to then it's a copyright violation. Just like it would be for a human doing the same. But highly derived code is not "stolen" code, neither for AI nor for humans.
That’s beside the point.
Me teaching my brain someone’s way of syntactically expressing procedures is analogous to AI developers teaching their model that same mode of expression.
No, a program that copies files is quite different to a human that writes those files data down and recalls them.
Can I copy you or provide you as a service?
To me, the argument is a LLM learning from GPL stuff == creating a derivative of the GPL code, just "compressed" within the LLM. The LLM then goes on to create more derivatives, or it's being distributed (with the embedded GPL code).
Yes, I provide it as a service to my employer. It's called a job. Guess what? When I read code I learn from it and my brain doesn't care what license that code is under.
That’s what my employers keep asking.
It's not your reading that would be illegal, but your copying. This is well a documented area of the law and there are concrete answers to your questions.
Are you saying that if I see a nice programming pattern in someone else’s code, I am not allowed to use that pattern in my code?
This seems bit nihilistic. You can't be automated. You can't process repos at scale.
Yet.
It'll be okay. We "destroyed" photography by uploading to places like Instagram and Facebook but photography as a whole is still alive. It turns out even though there is lots of stealing, the world still spins and people still seek out original creators.
I don't understand the case being made here at all. AI is violating FOSS licenses, I totally agree. But you can write more FOSS using AI. It's totally unfair, because these companies are not sharing their source, and extracting all of the value from FOSS as they can. Fine. But when it comes to OSI Open Source, all they usually had to do was include a text file somewhere mentioning that they used it in order to do the same thing, and when it comes to Free Software, they could just lie about stealing it and/or fly under the radar.
Free software needs more user-facing software, and it needs people other than coders to drive development (think UI people, subject matter specialists, etc.), and AI will help that. While I think what the AI companies are doing is tortious, and that they either should be stopped from doing it or the entire idea of software copyright should be re-examined, I also think that AI will be massively beneficial for Free Software.
I also suspect that this could result in a grand bargain in some court (which favors the billionaires of course) where the AI companies have to pay into a fund of some sort that will be used to pay for FOSS to be created and maintained.
Lastly, maybe Free Software developers should start zipping up all of the OSI licenses that only require that a license be included in the distribution and including that zipfile with their software written in collaboration with AI copilots. That and your latest GPL for the rest (and for your own code) puts you in as safe a place as you could possibly be legally. You'll still be hit by all of the "don't do evil"-style FOSS-esque licenses out there, but you'll at least be safer than all of the proprietary software being written with AI.
I don't know what textbook directs you to eliminate all of your competition by lowering your competition's costs, narrowing your moat of expertise, and not even owning a piece of that.
edit: that being said, I'm obviously talking about Free Software here, and not Open Source. Wasn't Open Source only protected by spirits anyway?
I wonder how this will affect latency,
Commoditize your compliment baby.
The threat of anti-trust creates a win for consumers, this is an example of why we need a strong FTC.
This is a standard “commoditize your complement” play. It’s in GitHub / Microsoft’s best interest to make sure none of the LLMs become dominant.
As long as that happens, their competitors light money on fire to build the model while GitHub continues to build / defend its monopoly position.
Also, given that there are already multiple companies building decent models, it’s a pretty safe bet Microsoft could build their own in a year or two if the market starts settling on one that’s a strategic threat.
See also: “embrace, extend, extinguish” from the 1990’s Microsoft antitrust days.
Anyone doing strategic business with Microsoft would do well to remember what they did to Nokia.
You mean waste a few billion on buying a company that couldn't compete with the market anymore because the iphone made "even an idiot should be able to use this thing, and it should be able to do pretty much everything" a baseline expectation with an OS/software experience to match? Nokia failed Nokia, and then Microsoft gave it a shot. And they also couldn't make it work.
(sure, that glosses over the whole Elop saga, but Microsoft didn't buy a Nokia-in-its-prime and killed it. They bought an already failing business and even throwing MS levels of resources at it couldn't turn it around)
Man, as a windows phone mourner the only disagreement i have with this comment is that they threw anywhere near MS level of resources at Nokia.
Satya never wanted the acquisition and nuked WP as soon as he could.
I can see why people would think that, but Microsoft did not buy Nokia.
They did bought the (then) richer half of the company. The other is now trying to get out of the rot.
> “The size of the Lego blocks that Copilot on AI can generate has grown [...] It certainly cannot write a whole GitHub or a whole Facebook, but the size of the building blocks will increase”
Um, that would make it less capable, not more... /thatguy
Can’t say I’m surprised
Yet another confirmation that AI models are nothing but commodities.
There's no moat, none.
I'm really curious how can any company building models hope to have any meaningful return from their billion dollars investments, when few people leaving and getting enough azure credits can get create a competitor in few months.
Seems to be trying to get its lunch money back from CodeGPT plugin and similar ones
Wake me up when they support self hosted llama or openwebui.
Wonder if we'll ever see a standard LLM API.
> Wonder if we'll ever see a standard LLM API.
At this point its just the OpenAI API
Isn there no open source alternative? Like a plugin or something.
for VScode you can use https://github.com/twinnydotdev/twinny
Not for visual studio 2022 unfortunately.
There are several plugins for VS 2022 that offer Copilot-like UI on top of a local Llama model, although I can't speak for their quality.
Hmm, I wonder why I didn't seem to find any.
https://www.continue.dev/ for autocomplete/chat
https://github.com/cline/cline for the agent thing
cursor.ai lets you use any OpenAI-compatible endpoint, although not all features work. And continue.dev does too, iirc.
I wish people would stop posting Bloomberg paywall links.
"cuts" has to be the worse word choice in this context, it sounds like they're terminating deals rather than creating them.
Common english lexicon should cut ties with the phrase "cut a deal"
Yeah I agree, could be confusing to non native speakers though. It's a weird idiom.
It's up there with "drops" for word swith multiple different meanings
Came here to say that - my reaction was initially "I didn't know they even had those deals to cut them!"
"inks"
is there a slim chance at a title change?
or a fat chance?
Can we change the title to “GitHub _signs_ deals with Google, Anthropic” ?
The original got me thinking it already had deals it was getting out of
To "cut a deal" is a common (American?) English idiom meaning to "make a deal".
But agree that it's better to avoid using idioms on a site that has many visitors for whom English is not their first language.
Do you mean that Bloomberg should have used a different title or Hacker News should have modified the title?
I think Bloomberg’s at fault: “cut a deal” isn’t usually that ambiguous because it’s clear which state transition is more likely. But here it’s plausible they could’ve been ending some existing training-data-sharing agreement, or that they were making a new different deal. Also the fact it’s pluralised here makes it different enough to the most common form for it to be a bit harder to notice the idiom. But since we can’t change the fact they used that title, I would like HN to change it now.
I agree, very weird choice of words.
That’s a strange usage of the word “cuts”. I thought GitHub terminated the deals with Google and Anthropic. It would be better if the title were GitHub signs AI deals instead of cuts.
https://plainenglish.com/expressions/cut-a-deal/#:~:text=Tod....
I'm assuming you're not a native speaker? (I'm not) - "to cut a deal" is a fairly common idiom that means to reach and agreement.
That’s correct. Not a native speaker. I am not well versed with slang words. I am sometimes embarrassed because I speak as if they are words from a book instead of sounding like spoken words. Do you know how cuts came to mean that it’s a deal. For a non-native speaker it means the exact opposite thing as in “he cut a wire”. Language evolves in strange ways.
"Cut a deal" is an idiom, not slang: it's appropriate language to use in a business context, for example.
The origin is hazy, of the theories I've seen I consider this the best one: "deal" means both "an agreement" and "to distribute cards in a card game". The dealer, in the latter sense, first cuts the card deck then deals the card. "Cut and deal" -> "cut a deal".
It could also be related to "cut a check", which comes from an era before perforated paper was widespread, when one would literally cut the check out of a book of checks.
Thanks much for the explanation.
As an aside, "closing" and "concluding" a deal or sale also usually mean to successfully reach an agreement. It's more of a semantic quirk around deals than an isolated idiom.
"cut a deal" is an idiom.
"cuts ____ deals" is not and more closely resembles the removal of deals related to ____.
Award for most ambiguous headline. (“cuts” can mean “initiates” or “terminates”!)
I also thought that they were ending some previous deal and not creating a new one.
“Cut a deal” is a standard phrase with only one meaning
But when you make it "cut AI deal", that breaks the standard phrase and opens the door to alternative explanations. I initially thought this was a news article about the deal breaking up.
I thought the same indeed.
Yes but cuts deals, which is what the title says, is ambiguous.
Those are some load bearing pluralisations right there.
A word can have multiple meanings.
Same with drops
The reason here is Microsoft is trying to make copilot a platform. This is the essential step to moving all the power from OpenAI to Microsoft. It would grant Microsoft leverage over all providers since the customers would depend on Microsoft and not OpenAI or Google or Anthropic. Classic platform business evolution at play here.
I'm sure there are multiple reasons, including lowering the odds of antitrust action by regulators. The EU was already sniffing around Microsoft's relationship with OpenAI.
I think the reason here is that Copilot is very very obviously inferior to Cursor, mostly because the model at its core is pretty dumb.
The Copilot team probably thinks of Cursor's efforts as cute. They can be a neat little product in their tiny corner of the market.
It's far more valuable to be a platform. Maybe Cursor can become a platform, but the race is on and they're up against giants that are moving rather surprisingly nimbly.
Github does way more, you can build on top of it, and they already have a metric ton of business relationships and enterprise customers.
A developer will spend far more time in the IDE than the version control system so I wouldn't discount it that easily. That being said, there are no network effects for an IDE and Cursor is basically just a VSCode plugin. Maybe Cursor gets a nice acquihire deal
A case where "cut" is its own antonym, and its unclear which sense is meant from the headline alone.
Cutting Deals and Striking Bargains: The History of an Idiom
https://web.archive.org/web/20060920230602/https://www.csub....
By way of "Why do we 'cut' a deal?" https://english.stackexchange.com/q/284233
---
"Cuts " ... leads to the initial parsing of "cuts all ties with" or similar "severs relationship with".
When with additional modifiers between "cuts" and "deal" the "cuts deal with" becomes harder to recognize as the "forms a deal with" meaning of the phase.
I just had the same problem and thought there was a deal that was ended now.
Yeah, I was expecting outrage when I first clicked into the thread to glance at the comments, and then I was like "wait, why are people saying it's exciting?"
Don’t think I’ve ever seen the word “cut” used with “deal” in a negative sense. Cutting a deal always means you made a deal, not that one ended.
What about "we were cut from the deal"? It seems like you could make a phrase in which 'cut' means "to exclude"
Doesn’t sound natural to me, and I couldn’t find any examples online using that phrasing to mean someone was removed from a deal. You can be cut from a team, though.
GitHub sublates AI deals with Google, Anthropic
I don’t like using AI assistants in my editor; I prefer to keep it as clean as possible. So, I manually copy relevant parts of the code into ChatGPT, ask my question, and continue interacting until I get what I need. It’s a bit manual, but since I use GPT for other tasks, it’s convenient to have a single interface for everything.
Paywall; can't read legally.
You can read legally, just need to pay first.
You mean "Microsoft" cuts deals with Google and Anthropic on top of their already existing deals with Mistral, Inflection whilst also having an exclusivity deal with OpenAI?
This is an extend to extinguish round 4 [0], whilst racing everyone else to zero.
[0] https://news.ycombinator.com/item?id=41908456
Whatever the motive, it's probably a smart move
I replaced ChatGPT Plus with hosted nvidia/Llama-3.1-Nemotron-70B-Instruct for coding tasks. Nemotron produces good code. The cost different is massive. Nemotron is available for $0.35 per Mtoken in and out. ChatGPT is considerably more expensive.
Just kidding. Qwen 2.5 Instruct is superior. Nemotron is overfit to pass benchmarks.