The latest AI scaling graph – and why it hardly makes sense

(garymarcus.substack.com)

27 points | by nsoonhui 5 hours ago ago

14 comments

  • Sharlin 3 hours ago

    > Unfortunately, literally none of the tweets we saw even considered the possibility that a problematic graph specific to software tasks might not generalize to literally all other aspects of cognition.

    How am I not surprised?

  • yorwba 4 hours ago

    > you could probably put together one reasonable collection of word counting and question answering tasks with average human time of 30 seconds and another collection with an average human time of 20 minutes where GPT-4 would hit 50% accuracy on each.

    So do this and pick the one where humans do best. I doubt that doing so would show all progress to be illusory.

    But it would certainly be interesting to know what the easiest thing is that a human can do but current AIs struggle with.

    • xg15 3 hours ago

      > But it would certainly be interesting to know what the easiest thing is that a human can do but current AIs struggle with.

      Still "Count the R's" apparently.

  • hatefulmoron 4 hours ago

    I had assumed that the Y axis was corresponding to some measurement of the LLM's ability to actually work/mull over a task in a loop while making progress. In other words, I thought it meant something like "you can leave Sonnet 3.7 for a whole hour and it will meaningfully progress on a problem", but the reality is less impressive. Serves me right for not looking at the fine print.

  • ReptileMan 4 hours ago

    The demand by a fraction of bay area intellectuals for ai disasters and doom of humanity way outstrips supply. The recent fanfic of Scott Alexander and other similar "thinkers" also is also worth checking out for a chuckle https://ai-2027.com/

    • ben_w 4 hours ago

      AI is software.

      As software gets more reliable, people come to trust it.

      Software still has bugs, the trust means those bugs still get people killed.

      That was true with things we wouldn't call AI any more, and still does with things we do.

      Doesn't need to take over or anything when humans are literally asleep at the wheel because they mistakenly think the AI can drive the car for them.

      Heck, even for building codes and health & safety rules, they're written in blood. Why would AI be the exception?

      • clauderoux 3 hours ago

        As Linus Thorval said in an interview recently, humans don't need AI to make bugs.

    • okthrowman283 4 hours ago

      To be fair though the author of 2027 has been prescient in his previous predictions

    • dist-epoch 4 hours ago

      Turkey fallacy.

      The apocalypse will only happen once. Just like global nuclear war.

      The fact that there was not a global nuclear war until now doesn't mean all those fearing nuclear war are crazy irrational people.

      • ReptileMan 4 hours ago

        No. It just means they are stupid in the way only extremely intelligent people could be

        • Sharlin 3 hours ago

          People being afraid of a nuclear war are stupid in a way only extremely intelligent people can be? Was that just something that sounded witty in your mind?

  • Nivge 4 hours ago

    TL;DR - the benchmark depends on its specific dataset, and it isn't a perfect representation to evaluate AI progress. That doesn't mean it doesn't make sense, or doesn't have value.

  • dist-epoch 4 hours ago

    > Abject failure on a task that many adults could solve in a minute

    Maybe author should check before pressing "Publish" if the info in the post is not already outdated.

    ChatGPT passed the image generation test mentioned: https://chatgpt.com/share/68171e2a-5334-8006-8d6e-dd693f2cec...

    • frotaur 4 hours ago

      Even excluding the fact that this image is simply to illustrate, and it's really not the main point of the article, in the chat you posted, ChatGPT actually failed again, because the r's are not circled.