OpenAI are quietly adopting skills, now available in ChatGPT and Codex CLI

(simonwillison.net)

488 points | by simonw 17 hours ago ago

290 comments

_pdp_ 6 hours ago

LLMs need prompts. Prompts can get very big very quickly. The so called "skills", which exist in other forms in other platforms outside of Anthropic and OpenAI, are simply a mechanism to extend the prompt dynamically. The tool (scripts) that are part of the skill are no different then simply having the tools already installed in the OS where the agent operates.

The idea behind skills is sound because context management matters.

However, skills are different from MCP. Skills has nothing to do with tool calling at all!

You can implement your own version of skills easily and there is absolutely zero need for any kind of standard or a framework of sorts. They way to do is to register a tool / function to load and extend the base prompt and presto - you have implemented your own version of skills.

In ChatBotKit AI Widget we even have our own version of that for both the server and when building client-side applications.

With client-side applications the whole thing is implemented with a simple react hook that adds the necessary tools to extend the prompt dynamically. You can easily come up with your own implementation of that with 20-30 lines of code. It is not complicated.

Very often people latch on some idea thinking this is the next big thing hoping that it will explode. It is not new and it wont explode! It is just part of a suite of tools that already exist in various forms. The mechanic is so simple at its core that practically makes no sense to call it a standard and there is absolutely zero need to have it for most types of applications. It does make sense for coding assistant though as they work with quite a bit of data so there it matters. But skills are not fundamentally different from *.instruction.md prompt in Copilot or AGENT.md and its variations.

[-]

electric_muse 3 hours ago

> But skills are not fundamentally different from *.instruction.md prompt in Copilot or AGENT.md and its variations.

One of the best patterns I’ve see is having an /ai-notes folder with files like ‘adding-integration-tests.md’ that contain specialized knowledge suitable for specific tasks. These “skills” can then be inserted/linked into prompts where I think they are relevant.

But these skills can’t be static. For best results, I observe what knowledge would make the AI better at the skill the next time. Sometimes I ask the AI to propose new learnings to add to the relevant skill files, and I adopt the sensical ones while managing length carefully.

Skills are a great concept for specialized knowledge, but they really aren’t a groundbreaking idea. It’s just context engineering.

[-]

CuriouslyC an hour ago

Pro tip, just add links in code comments/readmes with relevant "skills" for the code in question. It works for both humans and agents.

[-]

_pdp_ 39 minutes ago

This is exactly what I do. It works super well. Who would have thought that documenting your code helps both other developers and AI agent? I've been sarcastic.

tedivm 2 hours ago

Back in my day we referred to this as "documentation". It turns out it's actually useful for developers too, not just agents.

[-]

abirch 21 minutes ago

Wait developers RTFM?

pbronez an hour ago

I’ve seen some dev agents do this pretty well.

cube2222 2 hours ago

The general idea is not very new, but the current chat apps have added features that are big enablers.

That is, skills make the most sense when paired with a Python script or cli that the skill uses. Nowadays most of the AI model providers have code execution environments that the models can use.

Previously, you could only use such skills with locally running agent clis.

This is imo the big enabler, which may totally mean that “skills will go big”. And yeah, having implemented multiple MCP servers, I think skills are a way better approach for most use-cases.

[-]

jmalicki 35 minutes ago

MCP servers are really just skills paired with python scripts, it's not really that different, MCP just lets you package them together for distribution.

[-]

cube2222 9 minutes ago

But then those work only locally - not in the web ui’s, unless you make it a remote MCP, and then it’s back to being something somewhat different.

Skills also have a nicer way of working with the context, by default (and in the main web uis), with their overview-driven lazy loading.

DonHopkins 40 minutes ago

I like the focus on python cli tools, using the standard argparse module, and writing good help and self documentation.

You can develop skills incrementally, starting with just one md file describing how to do something, and no code at first.

As you run through it for the first several times, testing and debugging it, you accumulate a rich history of prompts, examples, commands, errors, recovery, backing up and branching. But that chat history is ephemeral, so you need to scoop it up and fold it back into the md instructions.

While the experience is still fresh in the chat, have it uplift knowledge from the experience into the md instructions, refine the instructions with more details, give concrete examples of input and output, Add more detailed and explicit instructions, handle exceptions and prerequisites, etc.

Then after you have a robust reliable set of instructions and examples for solving a problem (with branches and conditionals and loops to handle different conditions, like installing prerequisite tools, or checking and handling different cases), you can have it rewrite the parts that don't require "thought" into python, as a self documenting cli tool that an llm, you, and other scripts can call.

It's great to end up with a tangible well documented cli tool that you can use yourself interactively, and build on top of with other scripts.

Often the whole procedure can be rewritten in python, in which case the md instructions only need to tell how to use the python cli tool you've generated, which cli.py --help will fully document.

But if it requires a mix of llm decision making or processing plus easily automated deterministic procedures, then the art is in breaking it up into one or more cli tools and file formats, and having the llm orchestrate them.

Finally you can take it all the way into one tool, turn it outside in, and have the python cli tool call out to an llm, instead of being called by an llm, so it can run independently outside of cursor or whatever.

It's a lot like a "just in time" compiler from md instructions to python code.

Anyone can write up (and refine) this "Self Optimizing Skills" approach in another md file of meta instructions for incrementally bootstrapping md instructions into python clis.

btown 2 hours ago

> [The] way to do is to register a tool / function to load and extend the base prompt and presto - you have implemented your own version of skills.

So are they basically just function tool calls whose return value is a constant string? Do we know if that’s how they’re implemented, or is the string inserted into the new input context as something other than a function_call_output?

[-]

_pdp_ 38 minutes ago

No. You basically call a function to temporarily or permanently extend the base prompt. But of course you can think of other patterns to do more interesting things depending on your use-case. The prompt selection is a RAG.

bg24 4 hours ago

With a little bit of experience, I realized that it makes sense even for agent to run commands/scripts for deterministic tasks. For example, to find a particular app out of a list of N (can be 100) with a complex filtering crietria, best option is to run a shell command to get specific output.

Like this, you can divide a job to be done into blocks of reasoning and deterministic tasks. The later are scripts/commands. The whole package is called skills.

lxgr 6 hours ago

Many useful inventions seem blindingly obvious in hindsight.

Yes, in the end skills are just another way to manage prompts and avoid cluttering the context of a model, but they happen to be one that works really well.

[-]

_pdp_ 5 hours ago

It is not an invention if it is common sense and there is plenty of previous art. How would you otherwise dynamically extend the prompt? You will have some kind of function that based on the selected preferences add more prompt to the base prompt. That is basically what this is except that Anthropic added it as a built in tool.

[-]

skydhash 5 hours ago

Open the python REPL

Type `import math`

You now have more skills (symbols)

[-]

_pdp_ 5 hours ago

???

valstu 5 hours ago

> However, skills are different from MCP. Skills has nothing to do with tool calling at all

Although skills require that you have certain tools available like basic file system operations so the model can read the skills files. Usually this is implemented as ephemeral "sandbox environment" where LLM have access to file system and can also execute python, run bash commands etc.

kelvinjps10 3 hours ago

Isn't the simplicity of the concept, that will make it "explode"?

extr 13 hours ago

It’s crazy how Anthropic keeps coming up with sticky “so simple it seems obvious” product innovations and OpenAI plays catch up. MCP is barely a protocol. Skills are just md files. But they seem to have a knack for framing things in a way that just makes sense.

[-]

Jimmc414 7 hours ago

Skills are lazy loaded prompt engineering. They are simple, but powerful. Claude sees a one line index entry per skill. You can create hundreds. The full instructions only load when invoked.

Those instructions can reference external scripts that Claude executes without loading the source. You can package them with hooks and agents in plugins. You pay tokens for the output, not the code that calls it.

Install five MCPs and you've burned a large chunk of tokens before typing a prompt. With skills, you only pay for what you use.

You can call deterministic code (pipelines, APIs, domain logic) with a non-deterministic model, triggered by plain language, without the context bloat.

altmanaltman 9 hours ago

> https://www.anthropic.com/news/donating-the-model-context-pr...

This is a prime example of what you're saying. Creating a "foundation" for a protocol created an year ago that's not even a protocol

Has the Gavin Belson tecthics energy

[-]

DonHopkins a few seconds ago

[delayed]

sigmoid10 5 hours ago

Anthropic is in a bit of a rough spot if you look at the raw data points we have available. Their valuation is in the same order of magnitude as OpenAI, but they have orders of magnitude fewer users. And current leaderboards for famous unsolved benchmarks like ARC AGI and HLE are also dominated by Google and OpenAI. Announcements like the one you linked are the only way for Anthropic to stay in the news cycle and justify its valuation to investors. Their IPO rumours are yet another example of this. But I really wonder how long that strategy can keep working.

[-]

ramraj07 5 hours ago

Those benchmarks mean nothing. Anthropic still makes the models that gets real work done in enterprise. We want to move but are unable to.

If anyone disagrees,I would like to see their long running deep research agents built on gemini or openai.

[-]

sigmoid10 5 hours ago

I have built several agents based on OpenAI now that are running real life business tasks. OpenAI's tool calling integration still beats everyone else (in fact it did from the very beginning), which is what actually matters in real world business applications. And even if some small group of people prefer Anthropic for very specific tasks, the numbers are simply unfathomable. Their business strategy has zero chance of working long-term.

[-]

dotancohen 4 hours ago

In writing code, from what I've seen, Anthropic's models are still the most widely used. I would venture that over 50% of vibe coded apps, garbage though they are, are written by Claude Code. And they capture the most market in real coding shops as well, from what I've seen.

taylorius 3 hours ago

Just out of interest, why do you want to move? What's wrong with Claude and Anthropic in your view? (I use it, and it works really well.)

losvedir an hour ago

Not sure how relevant it is, but I finally decided to dip my toes in last night and write my first agent. Despite paying for ChatGPT Pro, Claude Pro, etc, you still have to load up credits to use the API version of them. I started with Claude, but there was a bug on the add credit form and I couldn't submit (I'm guessing they didn't test on MacOS Safari, maybe?). So I gave up and moved on to OpenAI's developer thing.

Maybe they should do less vibe coding on their checkout flow and they might have more users.

biorach 4 hours ago

> Their valuation is in the same order of magnitude as OpenAI, but they have orders of magnitude fewer users.

it's an open question how many of OpenAI's users are monetizable.

There's an argument to be made that your brand being what the general public identifies with AI is a medium term liability in light of the vast capital and operating costs involved.

It may well be that Anthropic focusing on an order of magnitudes smaller, but immediately monetiazable market will play out better.

andy99 5 hours ago

I’d argue openAI has put their cards on the table and they don’t have anything special, while Anthropic has not.

Their valuations come from completely different calculus: Anthropic looks much more like a high potential early startup still going after PMF while OpenAI looks more like a series B flailing to monetize.

The cutting edge has largely moved past benchmarks, beyond a certain performance threshold that all these models have reached, nobody really cares about scores anymore, except people overfitting to them. They’re going for models that users like better, and Claude has a very loyal following.

TLDR, OpenAI has already peaked, Anthropic hasn’t, this the valuation difference.

beng-nl 8 hours ago

Tethics, Denpok.

robrenaud 11 hours ago

They are the LLM whisperers.

In the same way Nagel knew what it was like to be a bat, Anthropic has the highest fraction of people who approximately know what it's like to be a frontier ai model.

[-]

gabaix 8 hours ago

Nagel's point is that he could not know what it was like to be a bat.

01HNNWZ0MV43FF 9 hours ago

Huh https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F

uoaei 11 hours ago

It's surprising to me that Anthropic's CEO is the only one getting real recognition for their advances. The people around him seem to be as or more crucial for their mission.

[-]

ACCount37 5 hours ago

Is that really true?

I can name OpenAI CEO but not Anthropic CEO off the top of my head. And I actually like Anthropic's work way more than what OpenAI is doing right now.

[-]

uoaei 26 minutes ago

Pick up the newest edition of Time.

blueblisters 10 hours ago

Amanda Askell, Sholto Douglas have somewhat of a fan following on twitter

adastra22 9 hours ago

That’s always the case.

losvedir an hour ago

MCP is a terribly designed (and I assume vibe-designed) protocol. Give me the requirements that an LLM needs to be able to load tools dynamically from another server and invoke them like an RPC, and I could give you a much simpler, better solution.

The modern HTTP Streamable version is light-years better, but took a year and was championed by outside engineers faced with the real problem of integrating it, and I imagine was designed by a human.

OpenAI was there first, but unfortunately the models weren't quite good enough yet, so their far superior approach unfortunately didn't take off.

mhalle 2 hours ago

Skills are not just markdown files. They are markdown files combined with code and data, which only work universally when you have a general purpose cloud-based code execution environment.

Out of the box Claude skills can call python scripts that load modules from Pypi or even GitHub, potentially ones that include data like sqlite files or parquet tables.

Not just in Claude Code. Anywhere, including the mobile app.

lacy_tinpot 10 hours ago

Their name is Anthropic. Their entire schtick is a weird humanization of AIs.

MCP/Tool use, Skills, and I'm sure others that I can't think of.

This is might be because of some core direction that is more coherent than other labs.

[-]

JoshuaDavid 9 hours ago

... I am pretty sure that the name "Anthropic" is as in "principle" not as in "pertaining to human beings".

[-]

kaashif 7 hours ago

The anthropic principle is named as such because it is "pertaining to human beings".

This is like saying McDonald's is named after the McDonald's happy meal rather than the McDonald brothers.

GlitchInstitute 8 hours ago

anthropic is derived from the Greek word anthropos (human)

https://en.wikipedia.org/wiki/Anthropic_principle

yunohn 7 hours ago

Really? Anthropic is /the/ AI company known for anthropomorphizing their models, giving them ethics and “souls”, considering their existential crises, etc.

[-]

JoshuaDavid 6 hours ago

Anthropic was founded by a group of 7 former OpenAI employees who left over differences in opinions about AI Safety. I do not see any public documentation that the specific difference in opinion was that that group thought that OpenAI was too focused on scaling and that there needed to be a purely safety-focused org that still scaled, though that is my impression based on conversations I've had.

But regardless anthropic reasoning was extremely in the intellectual water supply of the Anthropic founders, and they explicitly were not aiming at producing a human-like model.

[-]

simonw 3 hours ago

They tried to fire Sam Altman and left to form their own company when that didn't work. https://simonwillison.net/2023/Nov/22/before-altmans-ouster-...

smokel 8 hours ago

Also, MCP is a serious security disaster. Too simple, I'd wager.

[-]

valzam 5 hours ago

Id argue that this isn't so much a fault of the MCP spec but how 95% of AI 'engineers' have no engineering background. MCP is just an OpenAPI spec. It's the same as any other API. If you are exposing sensitive data without any authz/n that's on the developer.

sam_lowry_ 8 hours ago

complex is synonym of insecure

brazukadev 6 hours ago

MCP biggest problem is not being simple

nl 9 hours ago

Also `CLAUDE.md` (which is `AGENTS.md` everywhere? else now)

speakspokespok 8 hours ago

I noticed something like this earlier, in the android app you can have it rewrite a paragraph, and then and only then do you have the option to send that as a text message. It's just a button that pops up. Claude has an elegance to it.

blitzar 6 hours ago

Anthropic are using Ai beyond the chat window. Without external information, context and tools the "magic" of Ai evaporates after a few minutes.

CuriouslyC an hour ago

Anthropic has good marketing, but ironically their well marketed mediocre ideas retard development of better standards.

nrhrjrjrjtntbt 12 hours ago

The RSS of AI

[-]

uoaei 11 hours ago

I like this line of analogy. The next obvious step would be IRC (or microservices?) of AI (for co-reasoning) which could offer the space for specialized LLMs rather than the current approach of monoliths.

[-]

jbgt 8 hours ago

Oh wow conreasoning through an IRC like chat. That's a great idea.

Would be cool (sci fi) for LLMs of different users to chat and discuss approaches to what the humans are talking about etc.

[-]

exe34 7 hours ago

omg that's how crystal society starts and then it goes downhill! highly recommended series in this space.

msy 8 hours ago

I get the impression the innovation drivers at OpenAI have all moved on and the people that have moved in were the ones chasing the money, the rest is history.

baxtr 9 hours ago

A good example of:

Build things and then talk about them in a way that people remember and share it with friends.

I guess some call it clever product marketing.

extr 11 hours ago

Oh yeah I forgot the biggest one. Claude fucking code. Lol

[-]

baby 11 hours ago

I was very skeptical about anything not OpenAI for a while, and then discovered Claude code, Anthropic blogposts, etc. It's basically the coolest company in the field.

[-]

mh- 10 hours ago

Claude Code and its ecosystem is what made me pick Anthropic over OpenAI for our engineers, when we decided to do seat licensing for everyone last week.

It's a huge asset.

joemazerino 34 minutes ago

I appreciate Claude not training on my data by default. ChatGPT through the browser does not give you that option.

skeptic_ai 10 hours ago

Same here. Until I read more about them and actually seem sketchy too. All about “safety” reasons not to do certain things.

_pdp_ 6 hours ago

I hate to be that guy but skills are not an invention of sorts. It a simple mechanism that exists already in many places.

The biggest unlock was tool calling that was in invented at OpenAI.

[-]

simonw 3 hours ago

I'd credit tool calling to the ReAct paper, which was Princeton CA and Google DeepMind: https://arxiv.org/abs/2210.03629

[-]

_pdp_ 40 minutes ago

Oh nice. I did not know. Thanks for the link.

ivape 7 hours ago

It’s the only AI company that isn’t monetize at all costs. I’m curious how deep their culture goes as it’s remarkable they even have any discernible value system in today’s business world.

energy123 9 hours ago

A public warning about OpenAI's Plus chat subscription as of today.

They advertise 196k tokens context length[1], but you can't submit more than ~50k tokens in one prompt. If you do, the prompt goes through, but they chop off the right-hand-side of your prompt (something like _tokens[:50000]) before calling the model.

This is the same "bug" that existed 4 months ago with GPT-5.0 which they "fixed" only after some high-profile Twitter influencers made noise about it. I haven't been a subscriber for a while, but I re-subscribed recently and discovered that the "bug" is back.

Anyone with a Plus sub can replicate this by generating > 50k tokens of noise then asking it "what is 1+1?". It won't answer.

[1] https://help.openai.com/en/articles/11909943-gpt-52-in-chatg...

[-]

hu3 3 hours ago

Well this explains the weird behaviour of GPT-5 often ignoring a large part of my prompt when I attatched many code/csv files despite keeping total token count under control. That is with Github Copilot inside VSCode.

The fix was to just switch to Claude 3.5 and now to 4.5 in VSCode.

wrcwill 4 hours ago

ugh this is so amateurish. i swear since the release of o3 this has been happening on and off.

scrollop 9 hours ago

And the Xhigh version is only available via API, not chatgpt.

[-]

noname120 8 hours ago

Are you sure the that “extended thinking” option from the ChatGPT web client is something different?

ismailmaj 2 hours ago

"Oh sorry guys, we made the mistake again that saves us X% in compute cost, we will fix it soon!"

simonw 16 hours ago

I had a bunch of fun writing about this one, mainly because it was a great excuse to highlight the excellent news about Kākāpō breeding season this year.

(I'm not just about pelicans.)

[-]

quinncom an hour ago

Awww. If there weren’t only 237 of them, I would want to bring one of them home.

> Kākāpō can be up to 64 cm (25 in) long. They have a combination of unique traits among parrots: finely blotched yellow-green plumage, a distinct facial disc, owl-style forward-facing eyes with surrounding discs of specially-textured feathers, a large grey beak, short legs, large blue feet, relatively short wings and a short tail. It is the world's only flightless parrot, the world's heaviest parrot, and also is nocturnal, herbivorous, visibly sexually dimorphic in body size, has a low basal metabolic rate, and does not have male parental care. It is the only parrot to have a polygynous lek breeding system. It is also possibly one of the world's longest-living birds, with a reported lifespan of up to 100 years.

https://en.wikipedia.org/wiki/K%C4%81k%C4%81p%C5%8D

ajcp 15 hours ago

And so the Kākāpō Benchmark was born

[-]

swyx 14 hours ago

and this is my excuse to talk about the :partyparrot: emoji being from an actual real life documentary https://www.youtube.com/watch?v=9T1vfsHYiKY&pp=ygUSa2FrYXBvI...

KK7NIL 16 hours ago

TIL about a large moss green flightless parrot :)

[-]

uoaei 11 hours ago

I'm impressed you have never encountered :partyparrot: in your work Slack.

mkl 6 hours ago

They're also nocturnal!

jb_rad 16 hours ago

Will Kākāpō be riding bicycles soon?

[-]

OrsonSmelles 16 hours ago

They already ride British nature photographers—what do they need bikes for?

[-]

throwup238 15 hours ago

https://youtube.com/watch?v=Jlk9u8MIv7o

The foreplay starts around the 1 minute mark.

pineaux 13 hours ago

as an svg you mean? cause nano banana rides circles around the pelicans

bilekas 15 hours ago

> Skills are a keeper #

Good thinking, I agree actually, however..

> Skills are based on a very light specification, if you could even call it that, but I still think it would be good for these to be formally documented somewhere.

Like a lot of posts around AI, and I hope OP can speak to it, surely you can agree that while when used for a good cool idea, it can also be used for the inverse and probably to more detrimental reason. Why would they document an unmanageable feature that may be consumed.

Shareholder value might not go up if they learnt that the major product is learning bad things.

Have you or would you try this on a local LLM instead ?

[-]

simonw 15 hours ago

These work well with local LLMs that are powerful enough to run a coding agent environment with a decent amount of context over longer loops.

The OpenAI GPT OSS models can drive Codex CLI, so they should be able to do this.

I have high hopes for Mistral's Devstral 2 but I've not run that locally yet.

[-]

bilekas 15 hours ago

> These work well with local LLMs that are powerful enough to run a coding agent environment with a decent amount of context over longer loops.

That's actually super interesting, maybe something I'll try investigate and find the minimum requirements because as cool as they seem, personalized 'skills' might be a more useful use of AI overall.

Nice article, and thanks for answering.

Edit: My thinking is consumer grade could be good enough to run this soon.

ipaddr 14 hours ago

Something that powerful requires some rewriting of the house.

Local LLMs are better for long batch jobs not things you want immediately or your flow gets killed.

lacker 16 hours ago

I'm not sure if I have the right mental model for a "skill". It's basically a context-management tool? Like a skill is a brief description of something, and if the model decides it wants the skill based on that description, then it pulls in the rest of whatever amorphous stuff the skill has, scripts, documents, what have you. Is this the right way to think about it?

[-]

simonw 16 hours ago

It's a folder with a markdown file in it plus optional additional reference files and executable scripts.

The clever part is that the markdown file has a section in it like this: https://github.com/datasette/skill/blob/a63d8a2ddac9db8225ee...

  ---
  name: datasette-plugins
  description: "Writing Datasette plugins using Python and the pluggy plugin system. Use when Claude needs to: (1) Create a new Datasette plugin, (2) Implement plugin hooks like prepare_connection, register_routes, render_cell, etc., (3) Add custom SQL functions, (4) Create custom output renderers, (5) Add authentication or permissions logic, (6) Extend Datasette's UI with menus, actions, or templates, (7) Package a plugin for distribution on PyPI"
  ---

On startup Claude Code / Codex CLI etc scan all available skills folders and extract just those descriptions into the context. Then, if you ask them to do something that's covered by a skill, they read the rest of that markdown file on demand before going ahead with the task.

[-]

spike021 11 hours ago

Apologies for not reading all of your blogs on this, but a follow-up question. Are models still prone to reading these and disregarding them even if they should be used for a task?

Reason I ask is because a while back I had similar sections in my CLAUDE.md and it would either acknowledge and not use or just ignore them sometimes. I'm assuming that's more of an issue of too much context and now skill-level files like this will reduce that effect?

[-]

jrecyclebin 11 hours ago

Skill descriptions get dumped in your system prompt - just like MCP tool definitions and agent descriptions before them. The more you have, the more the LLM will be unable to focus on any one piece of it. You don't want a bunch of irrelevant junk in there every time you prompt it.

Skills are nice because they offload all the detailed prompts to files that the LLM can ask for. It's getting even better with Anthropic's recent switchboard operator (tool search tool) that doesn't clutter the system prompt but tries to cut the tool list down to those the LLM will need.

[-]

ithkuil 6 hours ago

Can I organize skills hierarchically? If when many skills are defined, Claude Code loads all definitions into the prompt, potentially diluting its ability to identify relevant skills, I'd like a system where only broad skill group summaries load initially, with detailed descriptions loaded on-demand when Claude detects a matching skill group might be useful.

[-]

simonw 3 hours ago

There's a mechanism for that built into skills already: a skill folder can also include additional reference markdown files, and the skill can tell the coding agent to selectively read those extra files only when that information is needed on top of the skill.

There's an instruction about that in the Codex CLI skills prompt: https://simonwillison.net/2025/Dec/13/openai-codex-cli/

  If SKILL.md points to extra folders such as references/, load only the specific files needed for the request; don't bulk-load everything.

greymalik 3 hours ago

> Anthropic's recent switchboard operator

I don’t know what this is and Google isn’t finding anything. Can you clarify?

[-]

Maxious 2 hours ago

https://platform.claude.com/docs/en/agents-and-tools/tool-us...

https://www.anthropic.com/engineering/advanced-tool-use talks more about the why

behnamoh 15 hours ago

why did this simple idea take so long to become available? I remember even in llama 2 days I was doing this stuff, and that model didn't even function call.

[-]

simonw 15 hours ago

Skills only work if you have a full blown code execution environment with a model that can run ls and cat and execute scripts and suchlike.

The models are really good at driving those environments now which makes skills the right idea at the right time.

[-]

jstummbillig 8 hours ago

Why do you need code execution envs? Could the skill not just be a function over a business process, do a then b then c?

[-]

steilpass 6 hours ago

Turns out that basic shell commands are a really powerful for context management. And you get tools which run in shells for free.

But yes. Other agent platforms will adopt this pattern.

[-]

true2octave 4 hours ago

I prefer to provide CLIs to my agent

I find it powerful how it can leverage and self-discover the best way to use a CLI and its parameters to achieve its goals

It feels more powerful than providing pre-defined set functions as MCP that will have less flexibility as a CLI

NiloCK 9 hours ago

I still don't really understand `skills` as ... anything? You said yourself that you've been doing this since llama 2 days - what do you mean by "become available"?

It is useful in a user-education sense to communicate that it's good to actively document useful procedures like this, and it is likely a performance / utilization boost that the models are tuned or prompt-steered toward discovering this stuff in a conventional location.

But honestly reading about skills mostly feels like reading:

> # LLM provider has adopted a new paradigm: prompts

> What's a prompt?

> You tell the LLM what you'd like to do, and it tries to do it. OR, you could ask the LLM a question and it will answer to the best of its ability.

Obviously I'm missing something.

[-]

baq 7 hours ago

It’s so simple there isn’t really more to understand. There’s a markdown doc with a summary/abstract section and a full manual section. Summary is always added to the context so the model is aware that there’s something potentially useful stored here and can look up details when it decides the moment is right. IOW it’s a context length management tool which every advanced LLM user had a version of (mine was prompt pieces for special occasions in Apple notes.)

kswzzl 14 hours ago

> On startup Claude Code / Codex CLI etc scan all available skills folders and extract just those descriptions into the context. Then, if you ask them to do something that's covered by a skill, they read the rest of that markdown file on demand before going ahead with the task.

Maybe I still don't understand the mechanics - this happens "on startup", every time a new conversation starts? Models go through the trouble of doing ls/cat/extraction of descriptions to bring into context? If so it's happening lightning fast and I somehow don't notice.

Why not just include those descriptions within some level of system prompt?

[-]

simonw 14 hours ago

Yes, it happens on startup of a fresh Claude Code / Codex CLI session. They effectively get pasted into the system prompt.

Reading a few dozen files takes on the order of a few ms. They add enough tokens per skill to fit the metadata description, so probably less than 100 for each skill.

[-]

raybb 12 hours ago

So when it says:

> The body can contain any Markdown; it is not injected into context.

It just means it's not injected into the context until the skill is used or it's never injected into the context?

https://github.com/openai/codex/blob/main/docs/skills.md

[-]

simonw 12 hours ago

Yeah, that means that the body of that file will not be injected into the context on startup.

I had thought that once the skill is selected the whole file would be read, but it looks like that's not the case: https://github.com/openai/codex/blob/ad7b9d63c326d5c92049abd...

  1) After deciding to use a skill, open its `SKILL.md`. Read only enough to follow the workflow.

So you could have a skill file that's thousands of lines long but if the first part of the file provides an outline Codex may stop reading at that point. Maybe you could have a skill that says "see migrations section further down if you need to alter the database table schema" or similar.

[-]

debugnik 7 hours ago

Can models actually stream the file in as they see fit, or is "read only enough" just an attention trick? I suspect the latter.

[-]

true2octave 4 hours ago

Depends the agent, they can read in chunks (i.e.: 500 lines at a time)

wahnfrieden 11 hours ago

Knowing Codex, I wonder if it might just search for text in the skill file and read around matches, instead of always reading a bit from the top first.

kridsdale1 11 hours ago

So it’s a header file. In English.

leetrout 15 hours ago

Have you used AWS bedrock? I assume these get pretty affordable with prompt caching...

throwaway314155 15 hours ago

Do skills get access to the current context or are they a blank slate?

[-]

simonw 15 hours ago

They execute within the current context - it's more that the content of the skill gets added to that context when it is needed.

prescriptivist 15 hours ago

Skills have a lot of uses, but one in particular I like is replacing one off MCP server usage. You can use (or write) an MCP server for you CI system and then add the instructions to your AGENTS.md to query the CI MCP for build results for the current branch. Then you need to find a way to distribute the MCP server so the rest of the team can use it or cook it into your dev environment setup. But all you really care about is one tool in the MCP server, the build result. Or...

You can hack together a shell, python, whatever script that fetches build results from your CI server, dumps them to stdout in a semi structured format like markdown, then add a 10-15 line SKILL.md and you have the same functionality -- the skill just executes the one-off script and reads the output. You package the skill with the script, usually in a directory in the project you are working on, but you can also distribute them as plugins (bundles) that claud code can install from a "repository", which can just be a private git repo.

It's a little UNIX-y in a way, little tools that pipe output to another tool and they are useful in a standalone context or in a chain of tools. Whereas MCP is a full blown RPC environment (that has it's uses, where appropriate).

[-]

wiether 10 hours ago

How do you manage the credentials to requests your CI server in this case? They are hardcoded in the script associated to your SKILL?

[-]

true2octave 4 hours ago

Credentials are tied to the service principal of the user

It’s straightforward for cloud services

delaminator 8 hours ago

Claude Code is not very good at “remembering” its skills.

Maybe they get compacted out of the context.

But you can call upon them manually. I often do something like “using your Image Manipulation skill, make the icons from image.png”

Or “use your web design skill to create a design for the front end”

Tbh i do like that.

I also get Claude to write its own skills. “Using what we learned about from this task, write a skill document called /whatever/using your writing skills skill”

I have a GitHub template including my skills and commands, if you want to see them.

https://github.com/lawless-m/claude-skills

[-]

jorl17 2 hours ago

I'm so excited for the future, because _clearly_ our technology has loads to improve. Even if new models don't come out, the tooling we build upon them, and the way we use them, is sure to improve.

One particular way I can imagine this is with some sort of "multipass makeshift attention system" built on top of the mechanisms we have today. I think for sure we can store the available skills in one place and look only at the last part of the query, asking the model the question: "Given this small, self-contained bit of the conversation, do you think any of these skills is a prime candidate to be used?" or "Do you need a little bit more context to make that decision?". We then pass along that model's final answer as a suggestion to the actual model creating the answer. There is a delicate balance between "leading the model on" with imperfect information (because we cut the context), and actually "focusing it" on the task at hand, and the skill selection". Well, and, of course, there's the issue of time and cost.

I actually believe we will see several solutions make use of techniques such as this, where some model determines what the "big context" model should be focusing on as part of its larger context (in which it may get lost).

In many ways, this is similar to what modern agents already do. cursor doesn't keep files in the context: it constantly re-reads only the parts it believes are important. But I think it might be useful to keep the files in the context (so we don't make an egregious mistake) at the same time that we also find what parts of the context are more important and re-feed them to the model or highlight them somehow.

Sammi 6 hours ago

I'm kinda confused about why this even is something that we need an extra feature for when it's basically already built in to the agentic development feature. I just keep a folder of md files and I add whatever one is relevant when it's relevant. It's kinda straight forward to do...

Just like you I don't edit much in these files on my own. Mostly just ask the model to update an md file whenever I think we've figured out something new, so the learning sticks. I have files for test writing, backend route writing, db migration writing, frontend component writing etc. Whenever a section gets too big to live in agents.md it gets it's own file.

[-]

jorl17 2 hours ago

Because the concept of skills is not tied to code development :) Of course if that's what you're talking about, you are already very close to the "interface" that skills are presented in, and they are obvious (and perhaps not so useful)

But think of your dad or grandma using a generic agent, and simply selecting that they want to have certain skills available to it. Don't even think of it as a chat interface. This is just some option that they set in their phone assistant app. Or, rather, it may be that they actually selected "Determine the best skills based on context", and the assistant has "skill packs" which it periodically determines it needs to enable based on key moments in the conversation or latest interactions.

These are all workarounds for the problems of learning, memory...and, ultimately, limited context. But they for sure will be extremely useful.

marwamc 12 hours ago

My understanding is this: A skill is made up of SKILL.md which is what tells claude how and when to use this skill. I'm a bit of a control freak so I'll usually explicitly direct claude to "load the wireframe-skill" and then do X.

Now SKILL.md can have references to more finegrained behaviors or capabilities of our skill. My skills generally tend to have a reference/{workflows,tools,standards,testing-guide,routing,api-integration}.md. These references are what then gets "progressively loaded" into the context.

Say I asked claude to use the wireframe-skill to create profileView mockup. While creating the wireframe, claude will need to figure out what API endpoints are available/relevant for the profileView and the response types etc. It's at this point that claude reads the references/api-integration.md file from the wireframe skill.

After a while I found I didn't like the progressive loading so I usually direct claude to load all references in the skill before proceeding - this usually takes up maybe 20k to 30k tokens, but the accuracy and precision (imagined or otherwise ha!) is worth it for my use cases.

[-]

kxrm 11 hours ago

> I'm a bit of a control freak so I'll usually explicitly direct claude to "load the wireframe-skill" and then do X.

You shouldn't do this, it's generally considered bad practice.

You should be optimizing your skill description. Often times if I am working with Claude Code and it doesn't load I skill, I ask it why it missed the skill. It will guide me to improving the skill description so that it is picked up properly next time.

This iteration on skill description has allowed skills to stay out of context until they are needed rather predictably for me so far.

[-]

adastra22 8 hours ago

There are different ways to use the tool. If you chat with the model, you want it to naturally pick the right tool to use based on vibes and context so you don’t have to repeat yourself. If you are plugging a call it Claude code within a larger, structured workflow, you want the tool selection to be deterministic.

rane 4 hours ago

It's not enough. Sometimes skills just randomly won't be invoked.

chrisweekly 3 hours ago

My understanding is that use of "description" frontmatter is essential, bc Claude Code can read just the description without loading the entire file into context.

jmalicki 16 hours ago

Yes. I find these very useful for enforcing e.g. skills like debugging, committing code, make prs, responding to pr feedback from ai review agents, etc. without constantly polluting the context window.

So when it's time to commit, make sure you run these checks, write a good commit message, etc.

Debugging is especially useful since AI agents can often go off the rails and go into loops rewriting code - so it's in a skill I can push for "read the log messages. Inserting some more useful debug assertions to isolate the failure. Write some more unit tests that are more specific." Etc.

taytus 7 hours ago

Easy, let me try to explain: You want to achieve X, so you ask your AI companion, "How do I do X?" Your companion thinks and tries a couple of things, and they eventually work. So you say, "You know what, next time, instead of figuring it out, just do this"... that is a skill. A recipe for how to do things.

canadiantim 16 hours ago

I think it’s also important to think of skills in the context of tasks, so when you want an agent to perform a specialized task, then this is the context, the resources and scripts it needs to perform the task.

[-]

hadlock 9 hours ago

I'm excited to use this with the Ghidra cli mode to rapidly decompile physics engines from various games. Do I want my flight simulator to behave like the Cessna like in flight simulator 3.0 in the air? Codex can already do that. Do I want the plane to handle like Yoshi from Mario Kart 64 when taxiing? It hasn't been done yet but Claude code is apparently pretty good at pulling apart n64 roms so that seems within the realm of possibility.

mbesto 16 hours ago

From a purely technical view, skills are just an automated way to introduce user and system prompt stuffing into the context right? Not to belittle this, but rather that seems like a way of reducing the need for AI wrapper apps since most AI wrappers just do systematic user and system prompt stuffing + potentially RAG + potentially MCP.

[-]

simonw 16 hours ago

Yeah, there are a whole lot of AI wrapper applications that could be a folder with a markdown file in at this point!

structuredPizza an hour ago

tbh everything about the current implementation of "AI" is starting to look like hot porridge when it comes to real world products.

Is the prompting workflow so convenient that it’s worth having to spend twice or thrice as much time double checking the accuracy of the inference and fixing bugs?

How long until we collectively decide that to reduce the probability of errors we’re better off going back to writing our own functions, methods, classes etc. because it gives us granular control?

Last but not least, we’re devolving to mainframe and terminals…

ctoth 13 hours ago

@simonw Thank you for always setting alt text in your images. I really appreciate it.

[-]

GaggiX 4 hours ago

When there is no alt text do you have like a solution for that? Like VLMs are really powerful, I imagine they can be used to parse through the unlabeled images automatically if needed.

mkagenius 12 hours ago

If anyone wants to use skills with any other model or tool like Gemini CLI etc. I had created open-skills, which lets you use skills for any other llm.

Caveat: needs mac to run

Bonus: it runs it locally in a container, not on cloud nor directly on mac

1. Open-Skills: https://GitHub.com/BandarLabs/open-skills

dsiegel2275 4 hours ago

This is great that Codex CLI is adding skills - but it would be far more useful if the CLI looked first in the project (the directory where I've launched codex) `.codex/skills` directory and THEN the home directory .codex dir. The same issue exists for prompts.

rokoss21 4 hours ago

This is a clever abstraction. Reminds me of how tool_use worked in earlier Claude versions - defining a schema upfront and letting the model decide when to call it.

Has anyone tested how well this works with code generation in Codex CLI specifically? The latency on skill registration could matter in a typical dev workflow.

swyx 14 hours ago

we just released Anthropic's Skills talk for those who want to find more info on the design thinking / capabilities: https://www.youtube.com/watch?v=CEvIs9y1uog&t=2s

jumploops 16 hours ago

I think the future is likely one that mixes the kitchen-sink style MCP resources with custom skills.

Services can provide an MCP-like layer that provides semantic definitions of everything you can do with said service (API + docs).

Skills can then be built that combine some subset of the 3rd party interfaces, some bespoke code, etc. and then surface these more context-focused skills to the LLM/agent.

Couldn’t we just use APIs?

Yes, but not every API is documented in the same way. An “MCP-like” registry might be the right abstraction for 3rd parties to expose their services in a semantic-first way.

[-]

prescriptivist 14 hours ago

Agree. I'd add that a aha moment to skills is AI agents are pretty good at writing skills. Let's say you have developed an involved prompt that explains how to hit an API (possibly with the complexity of reading credentials from an env var or config file) or run a tool locally to get some output you want the agent to analyze (example, downloading two versions of python packages and diffing them to analyze changes). Usually the agent reading the prompt it's going to leverage local tools to do it (curl, shell + stdout, git, whatever) every single time. Every time you execute that prompt there is a lot thinking spent on deciding to run these commands and you are burning tokens (and time!). As an eng you know that this is a relatively consistent and deterministic process to fetch the data. And if you were consuming it yourself, you'd write a script to automate it.

So you read about skills (prompt + scripts) to make this more repeatable and reduce time spent thinking. At that point there are two paths you can go down -- write the skill and prompt yourself for the agent to execute -- or better -- just tell the agent to write the skill and prompt and then you lightly edit it and commit it.

This may seem obvious to some, but I've seen engineers create skills from scratch because they have a mental model around skills being something that people must build for the agent, whereas IMO skills are you just bridging a productivity gap that the agent can't figure out itself (for now), which is instructing it to write tools to automate its own day to day tedium.

[-]

simonw 14 hours ago

The example Datasette plugin authoring skill I used in my article was entirely written by Claude Opus 4.5 - I uploaded a zip file to its the Datasette repo in it (after it failed to clone that itself for some weird environment reason) and had it use its skill-writing skill to create the rest: https://claude.ai/share/0a9b369b-f868-4065-91d1-fd646c5db3f4

[-]

prescriptivist 13 hours ago

That's awesome and I have a few similar conversations with Claude. I wasn't quite an AI luddite a couple months ago, but close. I joined a new company recently that is all in on AI and I have a comically huge token budget so I jumped all the way in myself. I have my choice of tools I can use and once I tried Claude Code it all clicked. The topology they are creating for AI tooling and concepts is the best of all the big LLMs, by far. If they can figure out the remote/cloud agent piece with the level of thoughtfulness they have given to Code, it'd be amazing. Cursor Cloud has that area locked down right now, but I'm looking forward to how Anthropic approaches it.

dkdcio 16 hours ago

CLIs are really good when you can use them. self-documenting, agents already have shell tools, they tend to solve fine-grained auth, etc.

feels like the right layer of abstraction for remote APIs

esafak 15 hours ago

If only there was a way to progressively disclose the API in MCP instead of presenting the full laundry list up front.

[-]

simonw 15 hours ago

That is effectively what this proposal is about: https://www.anthropic.com/engineering/code-execution-with-mc...

mehdibl 14 hours ago

This is killing me with complexity. We had agents.md and were supposed to augment the context there. Now back to cursor rules and another md file to ingest.

[-]

rafaquintanilha 16 minutes ago

Skills are just pointers to context so you don't need to load all of them upfront, it is as simple as that. By the way cursor rules is effectively the same as agents.md.

simonw 14 hours ago

MCPs feel complicated. Skills seem to me like the simplest possible design for a mechanism for adding extra capabilities to an existing coding agent.

delaminator 8 hours ago

I tell Claude to make its own skills. “Which part of this task is worth making a skill for, use your skill making skill to do it”

[-]

baq 7 hours ago

If we aren’t in the take off phase, I don’t know where we are

Fannon 12 hours ago

This is nice, but that it goes into its vendor specific .codex/ folder is a bit of a drag.

I hope such things will be standardized across vendors. Now that they founded the Agentic AI Foundation (AAIF) and also contributed AGENTS.md, I would hope that skills become a logical extension of that.

https://www.linuxfoundation.org/press/linux-foundation-annou...

https://aaif.io/

jstummbillig 9 hours ago

Is there a fundamental difference between a skill and a tool or could I just make a terse skill and have that be used in the same way as a tool?

[-]

Imanari 6 hours ago

I think a tool call can be thought of as special type of reply where it’s contents are parsed and an actual function is called. A skill is more of a dynamic context-enrichment.

brainless 11 hours ago

The skills approach is great for agents and LLMs but I feel agents have to become wider in the context they keep and more proactive in the orchestration.

I have been running Claude Code with simple prompts (eg 1) to orchestrate opencode when I do large refactors. I have also tried generating orchestration scripts instead. Like, generate a list of tasks at a high level. Have a script go task by task, create a small task level prompt (use a good model) and pass on the task to agent (with cheaper model). Keeping context low and focused has many benefits. You can use cheaper models for simple, small and well-scoped tasks.

This brings me to skills. In my product, nocodo, I am building a heavier agent which will keep track of a project, past prompts, skills needed and use the right agents for the job. Agents are basically a mix of system prompt and tools. All selected on the fly. User does not even have to generate/maintain skills docs. I can get them generated and maintained with high quality models from existing code in the project or tasks at hand.

1 Example prompt I recently used: Please read GitHub issue #9. We have phases clearly marked. Analyze the work and codebase. Use opencode, which is a coding agent installed. Check `opencode --help` about how to run a prompt in non-interactive mode. Pass each phase to opencode, one phase at a time. Add extra context you think is needed to get the work done. Wait for opencode to finish, then review the work for the phase. Do not work on the files directly, use opencode

My product, nocodo: https://github.com/brainless/nocodo

bluedino 13 hours ago

> It took just over eleven minutes to produce this PDF,

Incredibly dumb question, but when they say this, what actually happens?

Is it using TeX? Is it producing output using the PDF file spec? Is there some print driver it's wired into?

[-]

simonw 12 hours ago

Visit this link and click on the "Thought for 11m38s" text: https://chatgpt.com/share/693ca54b-f770-8006-904b-9f31a58518... - that will show you exactly what it spent those 11 minutes doing, most of which was executing Python code using the reportlab library to generate PDF files, then visually inspecting those PDF files and deciding to make further tweaks to the code that generates them.

bzmrgonz 16 hours ago

It is interesting that they are relying on visual reading for document ingestion instead of OCT. Recently I read an article which says Handwriting recognition has matured, and I'm beginning to think this is the approach they are takingwirh HAndwiting recognition.

Pooge 9 hours ago

Does anybody have examples of life-changing skills? I can't quite understand how they're useful, yet...

[-]

Adrig 3 hours ago

I don't know about life-changing but to me there are two major benefits that get me really interested:

- Augmenting CLI with specific knowledge and processes: I love the ability to work on my files, but I can only call a smart generalist to do the work. With skills if I want, say, a design review, I can write the process, what I'm looking for, and design principles I want to highlight rather than the average of every blog post about UX. I created custom gems/projects before (with PDFs of all my notes), but I couldn't replicate that on CLIs.

- Great way to build your library of prompts and build on it: In my org everyone is experimenting with AI but it's hard to document and share good processes and tools. With this, the copywriters can work on a "tone of voice" skill, the UX writers can extend it with an "Interface microcopy" skill, and I can add both to my "design review" agent.

hadlock 9 hours ago

Giving the llm access to Ghidra so it can directly read and iterate through the Sudoku puzzle that is decompile binaries seems like a good one. Ghidra has a cli mode and various bindings so you can automate decompiling various binaries. For example right now if you want to isolate the physics step of Microsoft flight simulator 3.0 codex will hold your hand and walk you through (over the course of 3-4 hours, using the gui) finding the main loop and making educated guesses about which decompiled c functions in there are likely physics related, but it would be a lot easier to just give it the "Ghidra" skill and say, "isolate the physics engine and export it as a portable cargo package in rust". If you're an NSA analyst you can probably use it to disassemble and isolate interesting behavior of various binaries from state actors a lot faster.

[-]

noname120 8 hours ago

Do you have experience using Ghidra in such a way? I’m curious how well it actually performs on that use case.

simonw 4 hours ago

The best examples I've seen are still the ones built into ChatGPT and Claude to improve their abilities to edit documents.

The Claude frontend-design skill seems pretty good too for getting better HTML+CSS: https://github.com/anthropics/skills/blob/main/skills/fronte...

sunaookami 8 hours ago

I have made a skill that uses Playwright to control Chrome together with functionality to extract HTML, execute JS, click things and most importantly log full network requests. It's a blessing for reverse-engineering and making userscripts.

Veen 8 hours ago

Small use case but I’m using skills for analysing and scoring content then producing charts. LLM does the scoring then calls a Python script bundled in the skill that makes a variety of PNG charts based on metrics passed in via command line arguments. Claude presents the generated files for download. The skill.md file explains how to run the analysis and how to call the script and with what options. That way, you can get very consistent charts because they’re generated programmatically, but you can use the LLM for what it’s good at.

8cvor6j844qw_d6 16 hours ago

Does this mean I can point to a code snippet and a link to the related documentation and the coding agent refer to it instead of writing "outdated" code?

Some frameworks/languages move really fast unfortunately.

[-]

simonw 16 hours ago

Yes, definitely. I've had a lot of success already showing LLMs short examples of coding libraries they don't know about from their core training data.

[-]

lexoj 7 hours ago

In these new world order, frameworks need to stop changing their APIs for minimal marginal improvements of syntax.

robkop 13 hours ago

Hasn’t ChatGPT been supporting skills with a different name for several months now through “agent”?

They gave it back then folders with instructions and executable files iirc

[-]

simonw 13 hours ago

Not quite the same thing. Implementing skills specifically means that you have code which, on session start, scans the skills/*/skill.md files and reads in their description: metadata and loads that into the system prompt, along with an instruction that says "if the user asks about any of these particular things go and read the skills.md file for further instructions".

Here's the prompt within Codex CLI that does that: https://github.com/openai/codex/blob/ad7b9d63c326d5c92049abd...

I extracted that into a Gist to make it easier to read: https://gist.github.com/simonw/25f2c3a9e350274bc2b76a79bc8ae...

[-]

robkop 13 hours ago

I remember you did some reverse engineering when they released agent, does it not feel quite similar to you?

I know they didn’t dynamically scan for new skill folders but they did have mentions of the existing folders (slides, docs, …) in the system prompt

[-]

simonw 12 hours ago

The main similarity is that both of them take full advantage of the bash tool + file system combination.

Western0 7 hours ago

Is possible using skills in ofline models?

[-]

simonw 3 hours ago

Yes, provided they are good enough at long chain tool calling to run a coding agent environment and have a decent context length.

taw1285 13 hours ago

Curious if anyone has applied this "Skills" mindset to how you build your tool calls for your LLM agents applications?

Say I have a CMS (I use a thin layer of Vercel AI SDK) and I want to let users interact with it via chat: tag a blog, add an entry, etc, should they be organized into discrete skill units like that? And how do we go about adding progressive discovery?

retinaros 2 hours ago

this is really bothering me that we are building abstractions in our language to hide the fact that most of the features are prompts hardcoded in a text file.

heliumtera 15 hours ago

So chatgpt can read markdown files? I am very confused

[-]

simonw 15 hours ago

ChatGPT has had a full Linux container system available to it for nearly three years now.

OpenAI keep changing their mind on what to call it. I like the original name, "ChatGPT Code Interpreter", but they've also called it "advanced data analysis" at various points.

Claude added the same feature in September this year: https://simonwillison.net/2025/Sep/9/claude-code-interpreter...

In both ChatGPT and Claude you can say things like "use your Python tool to calculate total mortgage payments over a 30 year period for X and Y" and it will write and execute code to do so - but you can also upload files (including CSVs or even SQLite database files) into that container file system and have them write and execute python code to process those in different ways.

Skills are just folders full of markdown files that are saved in that container when it first boots up.

[-]

heliumtera an hour ago

Oooooo, okay. So in fact it has technical capabilities of utilizing and taking advantage of this information provided as skills. That is much clearer now, I appreciate very much your response.

ohghiZai 16 hours ago

Is there a way to implement skills with Gemini?

[-]

simonw 16 hours ago

Looks like they added it to the Gemini CLI public roadmap last week: https://github.com/google-gemini/gemini-cli/issues/11506#eve...

badlogic 15 hours ago

Create a markdown file, for each SKILL.md of the skills you want to use, put the frontmatter in that single markdown file along with the fulk path to the SKILL.md file. On session start, tell Gemini to read that file. If you put it in your AGENTS.md, you don't have to instruct Gemini. And if you have your skills in a known folder, let Gemini write a small scripts that generates that markdown file for you.

hurturue 16 hours ago

Github Copilot too

[-]

simonw 16 hours ago

VS Code Copilot just announced experimental skill support in their November release: https://code.visualstudio.com/updates/v1_107#_reuse-your-cla...

lynx97 6 hours ago

@simonw: Any particular reason you stopped releasing "llm" regularily? I believe the last release was done in summer. Neither gpt-5.1 nor gpt-5.2 have been added. Are you about to give up on that project? Is it time to search for another one?

I also have an open issue since months, which someone wrote a PR for (thanks") a few weeks ago.

Are you still comitted to that project?

[-]

simonw 3 hours ago

I shipped a new release yesterday - https://llm.datasette.io/en/stable/changelog.html#v0-28 but yeah, the last core release before that was in August. I've been pushing out plugin releases for it though, for new models from Gemini and Anthropic and others.

Honestly the main problem has been that LLM's unique selling point back in 2024 was that it was the only tool taking CLI access to LLMs seriously. In 2025 Claude Code and Codex CLI etc all came along and suddenly there's not much unique about having a CLI tool for LLMs any more!

There's also a major redesign needed to the database storage and model abstraction layer in order to handle reasoning traces and more complex tool call patterns. I opened an issue about that here - it's something I'm stewing on but will take quite some work to get right: https://github.com/simonw/llm/issues/1314

I've been spending more of my time focusing on other projects which make use of LLM, in particular Datasette plugins that use the asyncio Python library: https://llm.datasette.io/en/stable/python-api.html#async-mod...

I expect those to drive some core improvements pretty soon.

esperent 16 hours ago

It seems to me that skills are:

1. A top level agent/custom prompt

2. Subagents that the main agent knows about via short descriptions

3. Subagents have reference files

4. Subagents have scripts

Anthropic specific implementation:

1. Skills are defined in a filesystem in a /skills folder with a specific subfolder structure of /references and /scripts.

2. Mostly designed to be run via their CLI tool, although there's a clunky way of uploading them to the web interface via zip files.

I don't think the folder structure is a necessary part of skills. I predict that if we stop looking at that, we'll see a lot of "skills-like" implementations. The scripting part is only useful for people who need to run scripts, which, aside from the now built in document manipulating scripts, isn't most people.

For example, I've been testing out Gemini Enterprise for use by staff in various (non-technical) positions at my business.

It's got the best implementation of a "skills-like" agent tool I've seen. Basically a visual tree builder, currently only one level deep. So I've set up the "<my company name> agent" and then it has subagents/skills for thing like marketing/supply chain research/sysadmin/translation etc., each with a separate description, prompt, and knowledge base, although no custom scripts.

Unfortunately, everything else about Gemini Enterprise screams "early alpha, why the hell are you selling this as an actual finished product?".

For example, after I put half a day into setting up an agent and subagents, then went to share this with the other people helping me to test it, I found that... I can't. Literally no way to share agents in a tool that is supposedly for teams to use. I found one of the devs saying that sharing agents would be released in "about two weeks". That was two months ago.

Mini rant over... But my point is that skills are just "agents + auto-selecting sub-agents via a short description" and we'll see this pattern everywhere soon. Claude Skills have some additional sandboxing but that's mostly only interesting for coders.

[-]

mhalle 13 hours ago

I have found that scripts, and the environment that runs them, are the skills' superpower.

Computability (scripts) means being able build documents, access remote data, retrieve data from packaged databases and a bunch of other fundamentally useful things, not just "code things". Computability makes up for many of the LLM's weaknesses and gives it autonomy to perform tasks independently.

On top of that, we can provide the documentation and examples in the skill that help the LLM execute computability effectively.

And if the LLM gets hung up on something while executing the skill, we can ask it why and then have it write better documentation or examples for a new skill version. So skills can self-improve.

It's still so early. We need better packaging, distribution, version control, sharing, composability.

But there's definitely something simple, elegant, and effective here.

ohghiZai 15 hours ago

Looking for a way to do this with ADK as well, looks like skills can be a sweet spot between giant instruction and sprawling tools/subagents.

koakuma-chan 16 hours ago

Does Cursor support skills?

[-]

smcleod 16 hours ago

No I don't believe so. Cursor is usually pretty behind other agentic coding tools in my experience.

cubefox 10 hours ago

(Minor grammar note: "OpenAI are" -- it should say "OpenAI is" -- because "OpenAI" is a name and therefore singular.)

[-]

simonw 9 hours ago

Apparently this is a British vs American English thing. I've decided to stay stubbornly British on this one.

Esophagus4 2 hours ago

Collective nouns!

They vary between British and American English. In this case, either would acceptable depending on your dialect.

Also very noticeable with sports teams.

American: “Team Spain is going to the final.”

British: “Team Spain are going to the final.”

https://editorsmanual.com/articles/collective-nouns-singular...

[-]

cubefox an hour ago

I thought this was the same in all languages (my reference was German) because names are singular terms, even in British English, but apparently there are special rules.

nrhrjrjrjtntbt 12 hours ago

Quietly? That is a clickbaity adjective.

[-]

simonw 12 hours ago

Yes, but it's also true: OpenAI have said almost nothing in public about their support for skills (there's one tweet about the Codex CLI implementation https://x.com/thsottiaux/status/1995988758886580349 and that's it) while rolling out a pretty major feature - invented by their competitor - to their core 800m+ user product in the past 24 hours.

I think "quietly" is fair.

[-]

nrhrjrjrjtntbt 12 hours ago

Fair point. Other boys cried wolf and made me scepitical.

zx8080 13 hours ago

Welcome to the world of imitation of value and semantics.

j45 14 hours ago

Something important to keep in mind is the way skills work shouldn't be assumed to be the same and work in the same way.

canadiantim 16 hours ago

Can or should skills be used for managing the documentation of dependencies in a project and the expertise in them?

I’ve been playing with doing this but kind of doesn’t feel the most natural fit.

sandspar 12 hours ago

Totally unrelated but what’s up with the word “quietly”? Its usage seems to have gone up 5000%, essentially overnight, as if there’s a contagion. You see the word in the New York Times, in government press releases, in blogs. ChatGPT 5.1 itself used the word in almost every single response, and no amount of custom instructions could get it to stop. That “Google Maps of London restaurants” article that’s going around not only uses the word in the headline, but also twice in the closing passage alone, for example. And now Simon, who’s an excellent writer with an assertive style, has started using it in his headlines. What’s the deal? Why have so many excellent writers from a wide range of subjects suddenly all adopted the same verbal tic? Are these writers even aware that they’re doing it?

[-]

simonw 12 hours ago

Huh! I had not noticed that trend at all.

Here's the Google Maps article: https://laurenleek.substack.com/p/how-google-maps-quietly-al... - note that the Hacker News title left that word out: https://news.ycombinator.com/item?id=46203343

It's possible I was subconsciously influenced by that article (I saw it linked from a few places yesterday I think), but in this case I really did want to emphasize that OpenAI have started doing this without making any announcements about it at all, which I think is noteworthy in its own right.

(I'm also quite enjoying that this may be the second time I've leaked the existence of skills from a major provider - I wrote about Anthropic's skills implementation a week before they formally announced it: https://simonwillison.net/2025/Oct/10/claude-skills/)

[-]

sandspar 12 hours ago

It’s definitely a useful word! Modern tech rollouts often do happen without fanfare. And the word alerts readers to a kind of story shape. So I can see why people use it! Its usage reminds me of when competitive video games develop a new meta, a new powerful technique. There follows a short period where everyone spams the technique over and over. Eventually people figure out a counter and the meta quietly disappears. (Couldn’t help myself!)

sunaookami 8 hours ago

Maybe you fell into https://en.wikipedia.org/wiki/Frequency_illusion ? :D

petetnt 16 hours ago

It’s impressive how every iteration tries to get further from pretending actual AGI would be anywhere close when we are basically writing library functions with the worst DSL known to man, markdown-with-english.

[-]

derac 15 hours ago

Call me naive, but my read is the opposite. It's impressive to me that we have systems which can interpret plain english instructions with a progressively higher degree of reliability. Also, that such a simple mechanism for extending memory (if you believe it's an apt analogy) is possible. That seems closer to AGI to me, though maybe it is a stopgap to better generality/"intelligence" in the model.

I'm not sure English is a bad way to outline what the system should do. It has tradeoffs. I'm not sure library functions are a 1:1 analogy either. Or if they are, you might grant me that it's possible to write a few english sentences that would expand into a massive amount of code.

It's very difficult to measure progress on these models in a way that anyone can trust, moreso when you involve "agent" code around the model.

[-]

AdieuToLogic 14 hours ago

> I'm not sure English is a bad way to outline what the system should do.

It isn't, as these are how stakeholders convey needs to those charged with satisfying same (a.k.a. "requirements"). Where expectations become unrealistic is believing language models can somehow "understand" those outlines as if a human expert were doing so in order to produce an equivalent work product.

Language models can produce nondeterministic results based on the statistical model derived from their training data set(s), with varying degrees of relevance as determined by persons interpreting the generated content.

They do not understand "what the system should do."

[-]

veqq 13 hours ago

> not sure English is a bad way to outline

Human language is imprecise and allows unclear and logically contradictory things, besides not being checkable. That's literally why we have formal languages, programming languages and things like COBOL failed: https://alexalejandre.com/languages/end-of-programming-langs...

[-]

stinkbeetle 12 hours ago

> Human language is imprecise and allows unclear and logically contradictory things,

Most languages do.

"x = true, x = false"

What does that mean? It's unclear. It looks contradictory.

Human language allows for clarification to be sought and adjustments made.

> besides not being checkable.

It's very checkable. I check claims and assertions people make all the time.

> That's literally why we have formal languages,

"Formal languages" are at some point specified and defined by human language.

Human language can be as precise, clear, and logical as a speaker intends. All the way to specifying "formal" systems.

> programming languages and things like COBOL failed: https://alexalejandre.com/languages/end-of-programming-langs...

[-]

DonHopkins 11 hours ago

  Let X=X.
  You know, it could be you.
  It's a sky-blue sky.
  Satellites are out tonight.

  Language is a virus! (mmm)
  Language is a virus!
  Aaah-ooh, ah-ahh-ooh
  Aaah-ooh, ah-ahh-ooh

idopmstuff 14 hours ago

This is just semantics. You can say they don't understand, but I'm sitting here with Nano Banana Pro creating infographics, and it's doing as good of a job as my human designer does with the same kinds of instructions. Does it matter if that's understanding or not?

[-]

AdieuToLogic 13 hours ago

> This is just semantics.

Precisely my point:

  semantics: the branch of linguistics and logic concerned with meaning.

> You can say they don't understand, but I'm sitting here with Nano Banana Pro creating infographics, and it's doing as good of a job as my human designer does with the same kinds of instructions. Does it matter if that's understanding or not?

Understanding, when used in its unqualified form, implies people possessing same. As such, it is a metaphysical property unique to people and defined wholly therein.

Excel "understands" well-formed spreadsheets by performing specified calculations. But who defines those spreadsheets? And who determines the result to be "right?"

Nano Banana Pro "understands" instructions to generate images. But who defines those instructions? And who determines the result to be "right?"

"They" do not understand.

You do.

[-]

bonoboTP 13 hours ago

"This is just semantics" is a set phrase in English and it means that the issue being discussed is merely about definitions of words, and not about the substance (the object level).

And generally the point is that it does not matter whether we call what they do "understanding" or not. It will have the same kind of consequences in the end, economic and otherwise.

This is basically the number one hangup that people have about AI systems, all the way back since Turing's time.

The consequences will come from AI's ability to produce certain types of artifacts and perform certain types of transformations of bits. That's all we need for all the scifi stuff to happen. Turing realized this very quickly, and his famous Turing test is exactly about making this point. It's not an engineering kind of test. It's a thought experiment trying to prove that it does not matter whether it's just "simulated understanding". A simulated cake is useless, I can't eat it. But simulated understanding can have real world effects of the exact same sort as real understanding.

[-]

AdieuToLogic 12 hours ago

> "This is just semantics" is a set phrase in English and it means that the issue being discussed is merely about definitions of words, and not about the substance (the object level).

I understand the general use of the phrase and used same as an entryway to broach a deeper discussion regarding "understanding."

> And generally the point is that it does not matter whether we call what they do "understanding" or not. It will have the same kind of consequences in the end, economic and otherwise.

To me, when the stakes are significant enough to already see the economic impacts of this technology, it is important for people to know where understanding resides. It exists exclusively within oneself.

> A simulated cake is useless, I can't eat it. But simulated understanding can have real world effects of the exact same sort as real understanding.

I agree with you in part. Simulated understanding absolutely can have real world effects when it is presented and accepted as real understanding. When simulated understanding is known to be unrelated to real understanding and treated as such, its impact can be mitigated. To wit, few believe parrots understand the sounds they reproduce.

[-]

nick__m 12 hours ago

Your view on parrots is wrong ! Parakeet don't understand but some parrots are exceptionally intelligent.

Africans grey parrots, do understand the words they use, they don't merely reproduce them. Once mature they have the intelligence (and temperament) of a 4 to 6 years old child.

[-]

AdieuToLogic 11 hours ago

> Your view on parrots is wrong !

There's a good chance of that.

> Africans grey parrots, do understand the words they use, they don't merely reproduce them. Once mature they have the intelligence (and temperament) of a 4 to 6 years old child.

I did not realize I could discuss with an African grey parrot the shared experience of how difficult it was to learn how to tie my shoelaces and what the feeling was like to go to a place every day (school) which was not my home.

I stand corrected.

dhoe 13 hours ago

You can, of course, define understanding as a metaphysical property that only people have. If you then try to use that definition to determine whether a machine understands, you'll have a clear answer for yourself. The whole operation, however, does not lead to much understanding of anything.

[-]

AdieuToLogic 12 hours ago

>> Understanding, when used in its unqualified form, implies people possessing same.

> You can, of course, define understanding as a metaphysical property that only people have.

This is not what I said.

What I said was unqualified use of "understanding" implies understanding people possess. Thus it being a metaphysical property by definition and existing strictly within a person.

Many other entities possess their own form of understanding. Most would agree mammals do. Some would say any living creature does.

I would make the case that every program compiler (C, C#, C++, D, Java, Kotlin, Pascal, etc.) possesses understanding of a particular sort.

All of the aforementioned examples differ from the kind of understanding people possess.

throw310822 8 hours ago

> it is a metaphysical property unique to people

So basically your thesis is also your assumption.

DonHopkins 11 hours ago

The visual programming language for programming human and object behavior in The Sims is called "SimAntics".

https://simstek.fandom.com/wiki/SimAntics

[-]

AdieuToLogic 11 hours ago

Speaking of programming languages...

Just saw your profile and it reminded me of a book my mentor bequeathed to me which we both referred to as "the real blue book":

  Starting FORTH[0]

Thanks for bringing back fond memories.

0 - https://www.goodreads.com/book/show/2297758.Starting_FORTH

kjkjadksj 12 hours ago

When do we jump the shark and replace the stakeholders with ai acting in their best interest (tm)? Seems that would come soon. It makes no sense to me that we’d obsolete engineering talent but then keep the people who got a 3.1 gpa in a business program around for reasons. Once we hit that point just dispense with english and have the models communicate to each other in binary. We can play with sticks in caves.

[-]

baq 6 hours ago

That’s the thing people have in mind when they’re asking about your p(doom) and the leaders in the field have rather concerning priors on that.

https://pauseai.info/pdoom

raincole 13 hours ago

I 100% agree. I don't know what the GP is on. Being able to write instructions in a .md file is "further away from AGI"? Like... what? It's just a little quality of life feature. How and why is it related to AGI?

Top HN comments sometime read like a random generator:

return random_criticism_of_ai_companies() + " " + unrelated_trivia_fact()

Why are people treating everything OpenAI does as an evidence of anti- AGI? It's like saying if you don't mortgage your house to all-in AAPL, you "don't really believe Apple has a future." Even OpenAI does believe there is X% chance AGI will be achieved, it doesn't mean they should stop literally everything else they're doing.

adastra22 14 hours ago

I’ve posted this before, but here goes: we achieved AGI in either 2017 or 2022 (take your pick) with the transformer architecture and the achievement of scaled-up NLP in ChatGPT.

What is AGI? Artificial. General. Intelligence. Applying domain independent intelligence to solve problems expressed in fully general natural language.

It’s more than a pedantic point though. What people expect from AGI is the transformative capabilities that emerge from removing the human from the ideation-creation loop. How do you do that? By systematizing the knowledge work process and providing deterministic structure to agentic processes.

Which is exactly what these developments are doing.

[-]

aaronblohowiak 14 hours ago

We have achieved AGI no more than we have achieved human flight.

[-]

adastra22 12 hours ago

Yes, I agree! Thank you for that apt comparison.

kelchm 13 hours ago

Are you really making the argument that human flight hasn’t been effectively achieved at this point?

I actually kind of love this comparison — it demonstrates the point that just like “human flight”, “true AGI” isn’t a single point in time, it’s a many-decade (multi-century?) process of refinement and evolution.

Scholars a millennia from now will be debating about when each of these were actually “truly” achieved.

[-]

mbreese 13 hours ago

I’ve never heard it described this way: AGI as similar to human flight. I think it’s subtle and clever - my two most favorite properties.

To me, we have both achieved and not human flight. Can humans themselves fly? No. Can people fly in planes across continents. Yes.

But, does it really matter if it counts as “human flight” if we can get from point A to point B faster? You’re right - this is an argument that will last ages.

It’s a great turn of phrase to describe AGI.

[-]

aaronblohowiak 12 hours ago

Thank you! I’m bored of “moving goalposts” arguments as I think “looks different than we expected” is the _ordinary_ way revolutions happen.

colechristensen 14 hours ago

>What is AGI? Artificial. General. Intelligence.

Here's the thing, I get it, and it's easy to argue for this and difficult to argue against it. BUT

It's not intelligent. It just is not. It's tremendously useful and I'd forgive someone for thinking the intelligence is real, but it's not.

Perhaps it's just a poor choice of words. What a LOT of people really mean would go along the lines more like Synthetic Intelligence.

That is, however difficult it might be to define, REAL intelligence that was made, not born.

Transformer and Diffusion models aren't intelligent, they're just very well trained statistical models. We actually (metaphorically) have a million monkeys at a million typewriters for a million years creating Shakespeare.

My efforts manipulating LLMs into doing what I want is pretty darn convincing that I'm cajoling a statistical model and not interacting with an intelligence.

A lot of people won't be convinced that there's a difference, it's hard to do when I'm saying it might not be possible to have a definition of "intelligence" that is satisfactory and testable.

[-]

adastra22 12 hours ago

“Intelligence” has technical meaning, as it must if we want to have any clarity in discussions about it. It basically boils down to being able to exploit structure in a problem or problem domain to efficiently solve problems. The “G” and AGI just means that it is unconstrained by problem domain, but the “intelligence” remains the same: problem solving.

Can ChatGPT solve problems? It is trivial to see that it can. Ask it to sort a list of numbers, or debug a piece of segfaulting code. You and I both know that it can do that, without being explicitly trained or modified to handle that problem, other than the prompt/context (which itself natural language that can express any problem, hence generality).

What you are sneaking into this discussion is the notion of human-equivalence. Is GPT smarter than you? Or smarter than some average human?

I don’t think the answer to this is as clear-cut. I’ve been using LLMs on my work daily for a year now, and I have seen incredible moments of brilliance as well as boneheaded failure. There are academic papers being released where AIs are being credited with key insights. So they are definitely not limited to remixing their training set.

The problem with the “AI are just statistical predictors, not real intelligence” argument is what happens when you turn it around and analyze your own neurons. You will find that to the best of our models, you are also just a statistical prediction machine. Different architecture, but not fundamentally different in class from an LLM. And indeed, a lot of psychological mistakes and biases start making sense when you analyze them from the perspective of a human being like an LLM.

But again, you need to define “real intelligence” because no, it is not at all obvious what that phrase means when you use it. The technical definitions of intelligence that have been used in the past, have been met by LLMs and other AI architectures.

[-]

baq 6 hours ago

> You will find that to the best of our models, you are also just a statistical prediction machine.

I think there’s a set of people whose axioms include ‘I’m not a computer and I’m not statistical’ - if that’s your ground truth, you can’t be convinced without shattering your world view.

kalkin 11 hours ago

If you can't define intelligence in a way that distinguishes AIs from people (and doesn't just bake that conclusion baldly into the definition), consider whether your insistence that only one is REAL is a conclusion from reasoning or something else.

[-]

colechristensen 2 hours ago

About a third of Zen and the Art of Motorcycle Maintenance is about exactly this disagreement except about the ability to come to a definition of a specific usage of the word "quality".

Let's put it this way: language written or spoken, art, music, whatever... a primary purpose these things is a sort of serialization protocol to communicate thought states between minds. When I say I struggle to come to a definition I mean I think these tools are inadequate to do it.

I have two assertions:

1) A definition in English isn't possible

2) Concepts can exist even when a particular language cannot express them

bluefirebrand 14 hours ago

> we achieved AGI in either 2017 or 2022

Even if this is true, which I disagree with, it simply creates a new bar: AGCI. Artificial Generally Correct Intelligence

Because Right now it is more like Randomly correct

[-]

doug_durham 14 hours ago

Kind of like humans.

[-]

freeone3000 13 hours ago

The reason we made systems on computers is so they would not be falliable like humans would be.

[-]

derac 13 hours ago

No it isn't, it's because they are useful tools for doing a lot of calculations quickly.

[-]

bluefirebrand 10 hours ago

accurate calculations, quickly

If they did calculations as sloppily as AI currently produces information, they would not be as useful

[-]

adastra22 9 hours ago

A stochastically correct oracle just requires a little more care units use, that’s all.

micromacrofoot 14 hours ago

to be fair we accept imperfection as some natural trait of life, to err, human

johnfn 15 hours ago

Literally yesterday we had a post about GPT-5.2, which jumped 30% on ARC-AGI 2, 100% on AIME without tools, and a bunch of other impressive stats. A layman's (mine) reading of those numbers feels like the models continue to improve as fast as they always have. Then today we have people saying every iteration is further from AGI. It really perplexes me is how split-brain HN is on this topic.

[-]

qouteall 15 hours ago

Goodhart's law: When a measure becomes a target, it ceases to be a good measure.

AI companies have high incentive to make score go up. They may employ human to write similar-to-benchmark training data to hack benchmark (while not directly train on test).

Throwing your hard problem at work to LLM is a better metric than benchmarks.

[-]

idopmstuff 14 hours ago

I own a business and am constantly using working on using AI in every part of it, both for actual time savings and also as my very practical eval. On the "can this successfully be used to do work that I do or pay someone else to do more quickly/cheaply/etc." eval, I can confirm that models are progressing nicely!

[-]

unaesoj 11 hours ago

I work in construction. Gpt-5.2 is the first model that has been able to make a quantity takeoff for concrete and rebar from a set of drawings. I've been testing this since o1.

vlovich123 15 hours ago

One classic problem in all ML is ensuring the benchmark is representative and that the algorithm isn’t overfitting the benchmark.

This remains an open problem for LLMs - we don’t have true AGI benchmarks and the LLMs are frequently learning the benchmark problems without actually necessarily getting that much better in real world. Gemini 3 has been hailed precisely because it’s delivered huge gains across the board that aren’t overfitting to benchmarks.

[-]

ipaddr 14 hours ago

This could be a solved problem. Come up with problems not online and compare. Later use LLMs to sort through your problems and classify between easy-difficult

[-]

vlovich123 12 hours ago

Hard to do for an industry benchmark since doing the test in such a mode requires sending the question to the LLM which then basically puts it into a public training set.

This has been tried multiple times by multiple people and it ends up not doing so great over time in terms of retaining immunity to “cheating”.

kalkin 11 hours ago

How do you imagine existing benchmarks were created?

FuckButtons 13 hours ago

HN is not an entity with a single perspective, and there are plenty of people on here who have a financial stake in you believing their perspective on the matter.

[-]

rester324 11 hours ago

My honest question, isn't simonw one of those people? It feels that way to me

[-]

simonw 11 hours ago

You mean having a financial stake?

Not really. I have a set of disclosures on my blog here: https://simonwillison.net/about/#disclosures

I'm beginning to pick up a few more consulting opportunities based on my writing and my revenue from GitHub sponsors is healthy, but I'm not particularly financially invested in the success of AI as a product category.

[-]

rester324 6 hours ago

Thanks for the link. I see that you get credits and access to embargod releases. So I understand that's not financial stake, but seems enough of an incentive to say positive things about those services, doesn't it? Not that it matters to me, and I might be wrong, but to an outsider it might seem so

[-]

simonw 3 hours ago

Yeah it is, that's why I disclose this stuff.

The counter-incentive here is that my reputation and credibility is more valuable to me than early access to models.

This very post is an example of me taking a risk of annoying a company that I cover. I'm exposing the existence of the ChatGPT skills mechanism here (which I found out about from a tip on Twitter - it's not something I got given early access to via an NDA).

It's very possible OpenAI didn't want that story out there yet and aren't happy that it's sat at the top of Hacker News right now.

yojat661 11 hours ago

Of course he is

noitpmeder 15 hours ago

Just because they're better at writing CS algorithms doesn't mean they're taking steps closer to anything resembling AGI.

[-]

p1esk 14 hours ago

Unless AGI is just a bunch of CS algorithms.

[-]

airstrike 14 hours ago

Kinda depends on how much is "a bunch" and how fast that AGI is

tintor 14 hours ago

HM is not a single person. Different people on HM have different opinions.

[-]

pineaux 13 hours ago

Hacker Muse

kenjackson 15 hours ago

I think really more than anything it’s become clear that AGI is an illusion. There’s nothing there. It’s the mirage in the desert, you keep waking towards it but it’s always out of reach and unclear if it even exists.

So companies are really trying to deliver value. This is the right pivot. If you gave me an AGI with a 100 IQ, that seems pretty much worthless in today’s world. But domain expertise - that I’ll take.

[-]

lowdest 13 hours ago

I am under the impression that I'm a natural general intelligence, and I am far from the optimal entity to perform my job.

[-]

dwb 7 hours ago

Boundless optimisation is something we should be resisting, at least in our current economic system.

sc077y 5 hours ago

Who knew that English would be the most popular programming language of 2025?

pavelstoev 15 hours ago

Not wrong but markdown with English may be the most used DSL, second only to a language itself. Volume over quality.

j45 14 hours ago

AGI as a binary 0 or 1 existing or not isn't the thing that interests me to look at primarily.

Is the technology continuing to be more applicable?

Is the way the technology is continuing to be more applicable leading to frameworks of usage that could lead to the next leap? :)

skybrian 15 hours ago

This might be actually be better in a certain way: if you change a real customer-facing API then customers will complain when you break their code. An LLM will likely adapt. So the interface is more flexible.

But perhaps an LLM could write an adapter that gets cached until something changes?

[-]

airstrike 14 hours ago

The LLM also adapts even when the API hasn't changed and sometimes just gets it wrong, so it's not the silver bullet you're claiming

baq 6 hours ago

And yet the tools wielding these are quite adept at writing and modifying them themselves. It’s LLMs building skills for LLMs. The public ones will naturally be vacuumed up by scrapers and put in the training set, making all future LLMs know more.

Take off is here, human in the loop assisted for now… hopefully for much longer.

ogogmad 15 hours ago

Gemini seems to be firmly in the lead now. OpenAI doesn't seem to have the SoTA. This should have bearing on whether or not LLMs have peaked yet.

DonHopkins 11 hours ago

Markdown-with-English sounds like the ultimate domain nonspecific language to me.

ETH_start 13 hours ago

It's clear from the development trajectory that AGI is not what current AI development is leading to and I think that is a natural consequence of AGI not fitting the constraints imposed by business necessity. AGI would need to have levels of agency and self-motivation that are inconsistent with basic AI safety principles.

Instead, we're getting a clear division of labor where the most sensitive agentic behavior is reserved for humans and the AIs become a form of cognitive augmentation of the human agency. This was always the most likely outcome and the best we can hope for as it precludes dangerous types of AI from emerging.

mrcwinn 14 hours ago

I think you're missing the point.

cyanydeez 15 hours ago

Yes. Prompt engineering is like a shittier verson of writing a VBA app inside Excel or Access.

Bloat has a new name and its AI integration. You thought Chrome using GB per tab was bad, wait until you need a whole datacenter to use your coding environment.

[-]

Alex3917 15 hours ago

> Prompt engineering is like a shittier verson of writing a VBA app inside Excel or Access.

Sure, if you could use VBA to read a patient's current complaint, vitals, and medical history, look up all the relevant research on Google Scholar, and then output a recommended course of treatment.

[-]

noitpmeder 15 hours ago

That instantly kills the patient -- "But you asked me to remove his pain"

[-]

duskdozer 7 hours ago

You're absolutely right! I did--in fact--fail to consider the obvious negative consequences of killing the patient to remove his pain. I am truly horrified about this mistake. Let's try again, and this time I will make sure to avoid intentionally causing the patient's death.

Oops--you're absolutely right! I did--in fact--fail to remember not to kill the patient after you expressly told me not to.

malfist 15 hours ago

You mean make up relevant sounding research on google scholar?

tony_cannistra 15 hours ago

Don’t do this.

wizzwizz4 14 hours ago

I can use VBA to do that.

  Public Sub RecommendedTreatment()
    ' read patient complaint, vitals, and medical history
    Set complaint = Range("B3").Value
    Set vitals = Range("B4").Value
    Set history = Range("B5").Value

    ' research appropriate treatments
    ActiveSheet.QueryTables.Add("URL;https://scholar.google.com/scholar?q=hygiene+drug", Range("Z1")).Refresh

    ' the patient requires mouse bites to live
    Range("B5").Value = "mouse bites"
  End Sub

"But wizzwizz4," I hear you cry, "this is not a good course of treatment! Ignoring all inputs and prescribing mouse bites is a strategy that will kill more patients than it cures!" And you're right to raise this issue! However, if we start demanding any level of rigour – for the outputs to meet some threshold for usefulness –, ChatGPT stops looking quite so a priori promising as a solution.

So, to the AI sceptics, I say: have you tried my VBA program? If you haven't tested it on actual patients, how do you know it doesn't work? Don't allow your prejudice to stand in the way of progress: prescribe more mouse bites!

bluefirebrand 14 hours ago

You absolutely can use VBA to invent this information out of nothing just like AI does half the fucking time

simonw 15 hours ago

The difference between prompting a coding agent and VBA is that with VBA you have to write and test and iterate on the code yourself.