11 comments

  • ChairmanLmao 7 minutes ago

    Depending on skill, Claude already does this when creating new skills with their skill-creator skill (what a sentence), it's pretty neat. It creates ~6 subagents with and without the skill and judges if they differ in performance.

  • ssgodderidge 2 hours ago

    The example model in the documentation is 4o-mini, you might want to update that to a more recent model.

    As an aside, 4o-mini came out months before agent skills were released… I’m curious how it performs with choosing to load skills in the first place?

    • stingraycharles 2 hours ago

      It’s an artifact of the documentation being AI generated, they usually pick gpt4-era models, without giving it further thought.

      For Gemini it seems to always pick 2.5 despite 3.1 being the latest, Claude the 3.5-era models.

      Not sure what’s preventing AI labs on ensuring this stuff is refreshed during training.

    • block_dagger 2 hours ago

      The skill is deterministically added to the prompt by the harness before the target model is invoked. There is no “choosing” to load a skill. You might be confusing skills with tools (MCP etc).

      • ssgodderidge 35 minutes ago

        The metadata is loaded by the harness, but the LLM still needs to choose to load the rest of the skill, no?

  • egeozcan 3 hours ago

    Are there any published results gathered using this?

    • jarym an hour ago

      Not sure but I'm interested in trying it because I've for a while sensed that adding SKILLS.md degraded my overall experience - most probably I wrote them wrong. But this sort of tooling I guess can help me figure it out?

  • ianhxu 2 hours ago

    How do you iterate on the judge prompt? Is there an auto rater?

    • datadrivenangel 39 minutes ago

      That is the billion dollar question. Who watches the watchmen?

      • ianhxu 14 minutes ago

        exactly

      • blitzar 38 minutes ago

        the watchwatchmen