I see a lot of snark in the comments. Simon is a researcher and I really like seeing his experiments! Sounds like the goal here was to delegate a discrete task to an LLM and have it solve the problem much like one would task a junior dev to do the same.
And like a junior dev it ran into some problems and needed some nudges. Also like a junior dev it consumed energy resources while doing it.
In the end I like that the chunk size of work that we can delegate to LLMs is getting larger.
No offense, but I hate all the comparisons to a "junior dev" that I see out there. This process is just like any dev! I mean, who wouldn't have to tinker around a bit to get some piece of software to work? Is there a human out there who would just magically type all the right things - no errors - first try?
> And like a junior dev it ran into some problems and needed some nudges.
There are people who don't get blocked waiting for external input in order to get tasks like this done, which I think is the intended comparison. There's a level of intuition that junior devs and LLMs don't have that senior devs do.
To offer a counterpoint, I had much better intuition as a junior than I do now, and it was also better than the seniors on my team.
Sometimes looking at the same type of code and the same infra day in and day out makes you rusty. In my olden days, I did something different every week, and I had more free time to experiment.
Codex is actually pretty good at getting things working and unblocking itself.
It’s just that when I review the code, I would do things differently because the agent doesn’t have experience with our codebase. Although it is getting better at in-context learning from the existing code, it is still seeing all of it for the “first time”.
It’s not a junior dev, it’s just a dev perpetually in their first week at a new job. A pretty skilled one, at that!
and a lot of things translate. How well do you onboard new engineers? Well written code is easier to read and modify, tests helps maintain correctness while showing examples, etc.
Point taken and I should have known better. I fully agree with you. I suppose I should say inexperienced dev or something more accurate. Having worked with many inexperienced devs there was quite a spread in capabilities. Using terms that are dismissive to individuals is not helpful.
I did the opposite yesterday. I used GPT5 to brute force dotnet into Claude Code for Web, which finally involved it writing an entire HTTP proxy in Python to download nuget packages.
No idea why Nvidia has such crusty torch prebuilds on their own hardware. Just finished installing unsloth on a Thor box for some finnetuning, it's a lengthy build marathon, thankfully aided by Grok giving commands/environment variables for the most part (one finishing touch is to install latest CUDA from nvidia website and then replace compiler executables in triton package with newer ones from CUDA).
Ehh, is it cool and time savings that it figured it out? Yes. But the solution was to get a “better” version prebuilt wheel package of PyTorch. This is a relatively “easy” problem to solve (figuring out this was the problem does take time). But it’s (probably, I can’t afford one) going to be painful when you want to upgrade the cuda version or specify a specific version. Unlike a typical PC, you’re going to need to build a new image and flash it. I would be more impressed when a LLM can do this end to end for you.
Pytorch + CUDA is a headache I've seen a lot of people have at my uni, and one I've never had to deal with thanks to uv. Good tooling really does go a long way in these things.
Although, I must say that for certain docker pass through cases, the debugging logs just aren't as detailed
Yup. The beauty of it is that the underlying ai accelerator/hardware is completely abstracted away. There’s a CoreML ONNX execution provider, though I haven’t used it.
No more fighting with hardcoded cuda:0 everywhere.
The only pain point is that you’ll often have to manually convert a PyTorch model from huggingface to onnx unless it’s very popular.
I see a lot of snark in the comments. Simon is a researcher and I really like seeing his experiments! Sounds like the goal here was to delegate a discrete task to an LLM and have it solve the problem much like one would task a junior dev to do the same.
And like a junior dev it ran into some problems and needed some nudges. Also like a junior dev it consumed energy resources while doing it.
In the end I like that the chunk size of work that we can delegate to LLMs is getting larger.
No offense, but I hate all the comparisons to a "junior dev" that I see out there. This process is just like any dev! I mean, who wouldn't have to tinker around a bit to get some piece of software to work? Is there a human out there who would just magically type all the right things - no errors - first try?
> And like a junior dev it ran into some problems and needed some nudges.
There are people who don't get blocked waiting for external input in order to get tasks like this done, which I think is the intended comparison. There's a level of intuition that junior devs and LLMs don't have that senior devs do.
To offer a counterpoint, I had much better intuition as a junior than I do now, and it was also better than the seniors on my team.
Sometimes looking at the same type of code and the same infra day in and day out makes you rusty. In my olden days, I did something different every week, and I had more free time to experiment.
So you are a worse dev now than you were before? Have you asked for a pay cut from your employer?
pay increase - with better tools, I'd imagine
Codex is actually pretty good at getting things working and unblocking itself.
It’s just that when I review the code, I would do things differently because the agent doesn’t have experience with our codebase. Although it is getting better at in-context learning from the existing code, it is still seeing all of it for the “first time”.
It’s not a junior dev, it’s just a dev perpetually in their first week at a new job. A pretty skilled one, at that!
and a lot of things translate. How well do you onboard new engineers? Well written code is easier to read and modify, tests helps maintain correctness while showing examples, etc.
Point taken and I should have known better. I fully agree with you. I suppose I should say inexperienced dev or something more accurate. Having worked with many inexperienced devs there was quite a spread in capabilities. Using terms that are dismissive to individuals is not helpful.
I did the opposite yesterday. I used GPT5 to brute force dotnet into Claude Code for Web, which finally involved it writing an entire HTTP proxy in Python to download nuget packages.
Compute well spent... finding out to download a version and hardware appropriate wheel.
Don't ask how many human compute hours are spent figuring this out.
Gotta keep the hype up!
No idea why Nvidia has such crusty torch prebuilds on their own hardware. Just finished installing unsloth on a Thor box for some finnetuning, it's a lengthy build marathon, thankfully aided by Grok giving commands/environment variables for the most part (one finishing touch is to install latest CUDA from nvidia website and then replace compiler executables in triton package with newer ones from CUDA).
serious q, why grok vs another frontier model?
Grok browses a large number of websites for queries that need recent information, which is super handy for new hardware like Thor.
I am the only one seeing this Nvidia Spark as meh?
I had it in my cart, but then watched few videos from influencers and it looks like power of this thing doesn't match the hype.
For inference might as well get a strix halo for half the price.
Its also going to be unsupported after a few years.
Ehh, is it cool and time savings that it figured it out? Yes. But the solution was to get a “better” version prebuilt wheel package of PyTorch. This is a relatively “easy” problem to solve (figuring out this was the problem does take time). But it’s (probably, I can’t afford one) going to be painful when you want to upgrade the cuda version or specify a specific version. Unlike a typical PC, you’re going to need to build a new image and flash it. I would be more impressed when a LLM can do this end to end for you.
Pytorch + CUDA is a headache I've seen a lot of people have at my uni, and one I've never had to deal with thanks to uv. Good tooling really does go a long way in these things.
Although, I must say that for certain docker pass through cases, the debugging logs just aren't as detailed
uv doesn’t fundamentally solve the issues. It didn’t invent venv or pip.
What fundamentally solves the issue is to use an onnx version of the model.
Do you know if it's possible to run ONNX versions of models on a Mac?
I should try those on the NVIDIA Spark, be interesting to see if they are easy to work with on ARM64.
Yup. The beauty of it is that the underlying ai accelerator/hardware is completely abstracted away. There’s a CoreML ONNX execution provider, though I haven’t used it.
No more fighting with hardcoded cuda:0 everywhere.
The only pain point is that you’ll often have to manually convert a PyTorch model from huggingface to onnx unless it’s very popular.
You can still upgrade CUDA within forward compatibility range and install new packages without reflashing.