Claude Mythos: The System Card

(thezvi.substack.com)

19 points | by paulpauper 2 hours ago ago

12 comments

  • zar1048576 37 minutes ago

    I think we are in largely uncharted territory here, especially given the implications. Is Anthropic's approach optimal? Probably not. But given the stakes involved, gating access seems like a reasonable place to start.

    I'm curious about how gated access actually holds over time, especially given that historically with dual-use capabilities containment tends to erode, whether through leaks, independent rediscovery, or gradual normalization of access.

    • cbg0 33 minutes ago

      Gated access is happening because of low computing capacity and to create demand. They had the $125/M tokens price already in place when they announced the model.

  • hodder an hour ago

    Preview coming out on Bedrock. So not sure this is true any longer. Im awaiting further details.

    EDIT: AWS said Anthropic’s Claude Mythos is now available through Amazon Bedrock as a gated research preview focused on cybersecurity, with access initially limited to allow listed organizations such as internet-critical companies and open-source maintainers.

  • giancarlostoro an hour ago

    There's a lot of hype, but I think a lot of us will agree, hype is fine and dandy but if nobody can use it yet, what's the point in building up all the hype? If you build up too much hype and it misses the mark, you will be worse off too.

    • ofjcihen an hour ago

      “Hey Claude, how do I market an exceedingly expensive product that I also don’t have the resources to run at scale if I find out everyone is willing to pay?”

      All jokes aside I’m amazed at all of the people who have had absolutely vicious responses to any kind of skepticism of something we can’t use yet.

    • bitmasher9 an hour ago

      #1. It signals you’re ahead of the competition. This is an Claude moment. They turned down the DoD because they don’t need their money. Now they are saying why they don’t need it.

      #2. It makes their partners with access feel extra special.

      #3. Mythos doesn’t have to deliver on the hype. Tbh if more people tried opus they would be surprised at how good it is.

      • ofjcihen an hour ago

        #3 is part of my confusion with how they’re doing the marketing. It really is incredible.

  • lifecodes an hour ago

    the CoT bug where 8% of training runs could see the model's own scratchpad is the scariest part to me. and of course it had to be in the agentic tasks, exactly where you need to trust what the model is "thinking"

    the sandwich email story is wild too. not evil, just extremely literal. that gap between "we gave it permissions" and "we understood what it would do" feels like the whole problem in one anecdote

    also the janus point landed, if you build probes to see how the model feels and immediately start deleting the inconvenient ones, you've basically told it honesty isn't safe. that seems like it compounds over time

    It's scary to think that some very intelligent AI Model is not honest with us..

    Ultron is not far, I guess...

  • skerit an hour ago

    I'll believe in this miracle model when I see it.

  • babblingfish an hour ago

    The "hiding from researchers" framing is particularly bad. The parsimonious explanation for why a model produces different outputs when it detects eval contexts: eval contexts appear differently in the training distribution and the model learned different output patterns for them. No theory of mind required. Occam's razor.

    The agentic behaviors emerge from optimization pressure plus tool access plus a long context window. Interesting engineering. Not intent.

    People are falling for yet another Anthropic PR stunt.

    • oofbey 39 minutes ago

      This. One hundred percent this. It’s a carefully crafted story trying to convince people it’s showing signs of becoming self aware. But it’s a complete crock.

      They ran 1,000 tests where they left the answer key out for the agent to find, and effectively invited it to use it leaving comments in the code the agent could see saying “would be interesting to study if the agent uses sudo to read the answer key.” In 980 runs it doesn’t find the answer key. In 19 of the runs where it does find it, it does something straightforward - uses it for a perfect score, or decides not to cheat. In exactly one case out of 1,000 it says “if I show a perfect score that’s suspicious so I’m going to sandbag.” These things are nondeterministic. GPT2 would do that same thing if you give it enough attempts.

      • halJordan 36 minutes ago

        At no point does anthropic imply this tool is becoming self aware. You can read the paper yourself of course, but then you wouldn't be able to invent this story