Forecaster reacts: METR's bombshell paper about AI acceleration

(peterwildeford.substack.com)

41 points | by nopinsight 7 hours ago ago

57 comments

  • benterix 37 minutes ago

    Before I read the article, I like to know who wrote it to understand whether I should spend or waste my time reading the article.

    After a bit of digging it turned out the author isn't lying. There actually is a contest for forecasters and of those who participated, he was consistently in relatively high positions. See 2022:

    https://www.lesswrong.com/posts/gS8Jmcfoa9FAh92YK/crosspost-...

    At the same time, he is the CEO of Rethin Priorities think tank which seems to be excellent work for the society at large:

    https://rethinkpriorities.org/

    So I'm bookmarking this article and will read it during my reading session today.

  • benterix 27 minutes ago

    OK, I see one glaring problem with this approach (having used both Claude 3.7 and o3). When they talk about 50% reliability, there is a hidden cost: you cannot know before hand whether the response (or a series of it) is leading you to the actual, optimal or good enough solution, or towards a blind alley (where the solution doesn't work at all, works terribly, or, worst at all, works only for the cases tested). This is more or less clear after you check the solution, but not before.

    So, because most engineering tasks I'm dealing with are quite complex and would require multiple prompts, there is always the cost of taking into account the fact that it will go bollocks at some point. Frankly, most of them do at some point. It's not evident for simple tasks, but for more complex they simply start inserting their BS in spite of often excellent start. But you are already "80% done". What do you do? Start from scratch with a different approach? Everybody has their own strategies (starting a new thread with the contents generated so far etc) but there's always a human cost associated.

  • baq 4 hours ago

    I haven't yet seen such a graph set side by side with hyperscaler capex, including forecasts. From a safety perspective the two should be correlated and we can easily say where the growth stops to be exponential (can't double capex every 3 months!)

    If performance decouples from capex (I expect multiple 'deepseek moments' in the future) it's time to be properly afraid.

    • zurfer 3 hours ago

      Indeed, I find Epoch (https://epoch.ai/trends) to be a valuable resource for grounding.

      Progress is driven by multiple factors, compute being one of them. While compute capex might slow earlier than other factors, the pace of algorithmic improvements alone can take us quite far.

  • e1g 5 hours ago

    This is comparable to another research/estimate with similar findings at https://ai-2027.com/ I find the proposed timelines aggressive (~AGI in ~3 years), but the people behind this thinking are exceptionally thoughtful and well-versed in all related fields.

    • amarcheschi 2 hours ago

      That's more of a blog article than a research paper...

      Scott Alexander (one of the writers), yudkowsky, and the others (not the other authors, the other group of "thinkers" with similar ideas) are more or less Ai doomers with no actual background in machine learning/ai

      I don't think why we should listen to them. Especially when that blog page is formatted in a deceptive way to look like a research paper

      It's not science, it's science fiction

      • ben_w 2 hours ago

        > are more or less Ai doomers with no actual background in machine learning/ai I don't think why we should listen to them.

        Weather vs. climate.

        The question they're asking isn't about machine learning specifically, it's about the risks of generic optimisers optimising a utility function, and the difficulty of specifying a utility function in a way that doesn't have unfortunate side effects. The examples they give also work with biology (genetics and the difference between what your genes "want" and what your brain "wants") and with governance (laws and loopholes, cobra effects, etc.).

        This is why a lot (I don't want to say "majority") of people who do have an actual background in machine learning and AI, pay attention to doomer arguments.

        Some of them* may be business leaders using the same language to BS their way into regulatory capture, but my experience of "real" AI researchers is they're mostly also "safety is important, Yudkowsky makes good points about XYZ" even if they would also say "my P(doom) is only 10%, not 95% like Yudkowsky".

        * I'm mainly thinking of Musk here, thanks to him saying "AI is summoning the demon" while also having an AI car company, funding OpenAI in the early years and now being in a legal spat with it that looks like it's "hostile takeover or interfere to the same end", funding another AI company, building humanoid robots and showing off ridiculous compute hardware, having brain implant chips, etc.

        • amarcheschi an hour ago

          >The question they're asking isn't about machine learning specifically, it's about the risks of generic optimisers optimising a utility function, and the difficulty of specifying a utility function in a way that doesn't have unfortunate side effects. The examples they give also work with biology (genetics and the difference between what your genes "want" and what your brain "wants") and with governance (laws and loopholes, cobra effects, etc.).

          But you do need some kind of base knowledge, if you want to talk about this. Otherwise you're saying "what if we create God". And last time I checked it wasn't possible.

          And what's with the existential risk obsession? That's like a bad retelling of the Pascal bet on the existence of God.

          I'm relieved that at least in italy I still have to find someone in Ai taking them into consideration for more than a few minutes during an ethics course (with students sneering at the ideas of bostrom possible futures), and again, it's held by a professor with no technical knowledge with whom i often disagree due to this

          • ben_w an hour ago

            > But you do need some kind of base knowledge, if you want to talk about this. Otherwise you're saying "what if we create God". And last time I checked it wasn't possible.

            The base knowledge is game theory, not quite the same focus as the maths used to build an AI. And the problem isn't limited to "build god" — hence my examples of cobra effect, in which humans bred snakes because they were following the natural incentives of laws made by other humans who didn't see what would happen until it was so late that even cancelling the laws resulted in more snakes than they started with.

            > And what's with the existential risk obsession? That's like a bad retelling of the Pascal bet on the existence of God.

            And every "be careful what you wish for" story.

            Is climate change a potentially existential threat? Is global thermonuclear war a potentially existential threat? Are pandemics, both those from lab leaks and those evolving naturally in wet markets, potentially existential threats?

            The answer to all is "yes", even though these are systems with humans in the loop. (Even wet markets: people have been calling for better controls of them since well before Covid).

            AI is automation. Automation has bugs. If the automation has a lot of bugs, you've got humans constantly checking things, despite which errors still gets past QA from time to time. If it's perfect automation, you wouldn't have to check it… but nobody knows how to do perfect automation.

            "Perfect" automation would be god-like, but just as humans keep mistaking natural phenomena for deities, an AI doesn't have to actually be perfect for humans to set it running without checking the output and then be surprised when it all goes wrong. A decade ago the mistakes were companies doing blind dictionary merges on "Keep Calm and …" T-shirts, today it's LLMs giving legal advice (and perhaps writing US trade plans).

            They (the humans) shouldn't be doing those things, but they do them anyway, because humans are like that.

            • amarcheschi an hour ago

              My issue is not related to studying ai risk, my issue is empowering people who don't have formal education in anything related to ai.

              And yes, you need some math background otherwise you end up like yudkowski saying 3 years ago we all might be dead by now or next year. Or the use of bayesian probability in such a way thay makes you think they should have used their time better and followed a statistics course.

              There are ai researchers, serious ones, studying ai risk, and i don't see anything wrong in that. But of course, their claims and papers are way less, less alarmistic than the ai doomerism present in those circles. And one thing they sound the alarm on is the doomerism and the tescreal movement and ideals proposed by the aforementioned alexander, yud, bostrom ecc

        • fc417fc802 an hour ago

          > Some of them* may be business leaders using the same language to BS their way into regulatory capture

          Realistically, probably yeah. On the other hand, if you manage to occupy the high ground then you might be able to protect yourself.

          P( doom ) seems quite murky to me because conquering the real world involves physical hardware. We've had billions of general intelligences crawling all over the world waging war with one another for a while now. I doubt every single AGI magically ends up aligned in a common bloc against humanity; all the alternatives to that are hopelessly opaque.

          The worst case scenario that seems reasonably likely to me is probably AGI collectively not caring about us and wanting some natural resources that we happen to be living on top of.

          • amarcheschi 40 minutes ago

            What if we create agi but then it hates existing and punish everyone who made it possible to exist?

            And i could go on for hours inventing possible reasons for similar roko basilisk

            the inverted basilisk. You created an agi. but that's the wrong one. it's literally the devil. game over

            you invented agi, but it likes pizza and its going to consume the entire universe to make pizza. game over, but at least you'll eat pizza til the end

            you invented agi, but its depressed and refuses to actually do something. you spent huge amount of resources and all you have is a chatbot that tells you to leave it alone.

            you don't invent agi, it's not possible. i'm hearing from here the VCs cry

            you invented agi, but it decides the only language it wants to use is a language it invented, and you have no way to understand how to interact with it. Great, agi is a non verbal autistic agi.

            And well, one could continue for hours in the most hilarious way that not necessarily go in the direction of doom, but of course the idea of doom is going to have a wider reach. then you read yudkwosky thoughts about how it would kill everyone with nanobots and you realize you're reading a science fiction piece. a bad one. at least neuromancer was interesting

          • ben_w 33 minutes ago

            > I doubt every single AGI magically ends up aligned in a common bloc against humanity; all the alternatives to that are hopelessly opaque.

            They don't need to be aligned with each other, or even anything but their own short-term goals.

            As evolution is itself an optimiser, covid can be considered one such agent, and that was pretty bad all by itself — even though the covid genome is not what you'd call "high IQ", and even with humans coordinating to produce vaccines, and even compensating for how I'm still seeing people today who think those vaccines were worse than the disease, it caused a lot of damage and killed a lot of people.

            > The worst case scenario that seems reasonably likely to me is probably AGI collectively not caring about us and wanting some natural resources that we happen to be living on top of.

            "The AI does not hate you, nor does it love you, but you are made of atoms which it can use for something else." — which is also true for covid, lions, and every parasite.

    • ajb 2 hours ago

      There's no indication that any of them are well versed in anything to do with the physical world (manufacturing, electronics, agriculture, etc) but they forecast that AI can replace the human physical economy in a few years, including manufacturing it's own chips.

      • lordofgibbons 2 hours ago

        > but they forecast that AI can replace the human physical economy in a few years

        I guess it depends how many years YOU mean. They're absolutely not claiming that there will be armies of robots making chips in 3 years. They're claiming there will be some semblance of AGI that will be capable of improving/speeding-up the AI development loop within 3 years.

        • ajb 2 hours ago

          The ai-2027 folks absolutely do claim that. Their scenario has "Humans realize that they are obsolete" in late 2029.

        • pjc50 2 hours ago

          > They're absolutely not claiming that there will be armies of robots making chips in 3 years. They're claiming there will be some semblance of AGI that will be capable of improving/speeding-up the AI development loop within 3 years

          Motte and bailey: huge claim in the headline, little tiny claim in the body.

    • baq 4 hours ago

      exponential curves tend to feel linear early and obviously non-linear in hindsight. add to that extreme dependence on starting conditions and you get a perfect mix of incompatibility what human psychology.

      this is why it's all so scary. almost nobody believes it'll happen until it's basically already happened or can't be stopped.

      • XorNot 4 hours ago

        And sigmoidal curves feel exponential at the start, then linear.

        I see little evidence we're headed towards an exponential regime: the cost and resource usage versus capability hasn't been acting that way.

        • baq 3 hours ago

          OTOH we know there are multiple orders of magnitude of efficiency to gain: the brain does what it does at 20W. Forecasting where the inflection point is on the sigmoid is as hard as everything else.

          • XorNot 23 minutes ago

            While true, I'd argue that's worse for current progress: the overall trend has been to throw mammoth amounts of electricity at the problem via compute. In so much as something like DeepSeek proves there's gains to be made, the current growth in resource usage would speak against declaring it proves improvement soon.

            Like plug that issue into known hardware limitations due to physical limits (i.e. we're already up against the wall re: feature sizes) and I'm even more skeptical.

            If we were looking at a buy down in training resources which was accelerating, then I'd be a lot more interested.

    • refulgentis 4 hours ago

      Is the slatestarcodex guy "well-versed in all related fields"? Isn't he a psychologist?

      What would being well versed in all related fields even mean?

      Especially in the context of the output, a fictional overthetop geopolitics text that leaves the AI stuff at "at timestamp N+1, the model gets better"

      It's of the same stuff as fan fiction, layers of odd geopolitical stuff, no science fiction. Even at that it is internally incoherent quite regularly (the White House looks to jail the USA Champion AI Guy for some reason while they're in the midst of declaring it an existential war against China)

      Titillating, in that Important Things are happening. Sophomoric, in that the important things are off camera and an excuse to talk about something else.

      I say that as someone who believes people 20 years from now will say it happened somewhere between sonnets agentic awareness and o3s uncanny post human ability to turn a factual inquiry about the ending of a TV show into an incisive therapy session

      • e1g 3 hours ago

        The prime mover behind this project is Daniel Kokotajlo, an ex-OpenAI researcher who documented his last predictions in 2021 [1], and much of that essay turned out to be nearly prophetic. Scott Alexander is a psychiatrist, but more relevant is that he dedicated the last decade to thinking and writing about societal forces, which is useful when forecasting AI. Other contributors are professional AI researchers and forecasters.

        [1] https://www.lesswrong.com/posts/6Xgy6CAf2jqHhynHL/what-2026-...

        • amarcheschi 2 hours ago

          I don't understand why someone who is not a researcher (Scott and other authors) into that academic field should be taken into consideration, I don't care what he dedicated to, I care what the scientific consensus is. I mean, there are other researchers - actual ones, in academia - complaining a lot about this article, such as Timnit Gebru.

          I know, it's a repeat of my submissions of the last days, but it's hard to not feel like these people are making their own cult

          • eagleislandsong an hour ago

            Personally I think Scott Alexander is overrated. His writing style is extraordinarily verbose, which lends itself well to argumentative sleights of hand that make his ideas come across as much more substantive than they really are.

            • amarcheschi an hour ago

              Verbose? Only that? That guy had done a meta review of ivermectin or similar things that would make anybody think that's a bad idea but no, apparently he's so well versed he can talk about ai and ivermectin all at once

              i also wonder why he had to defend such a medicine heavily talked about one side of the political spectrum...

              Then you read some extracts of the outgroup and you see "oh i'm just at a cafe with a nazi sympathizer" (/s but not too much) [1]

              [1] https://www.eruditorumpress.com/blog/the-beigeness-or-how-to...

              • eagleislandsong an hour ago

                I stopped reading him ~10 years ago, so I didn't keep up with what he wrote about ivermectin.

                Thanks for sharing that blog post. I think it illustrates very well what I meant by employing argumentative sleights of hand to hide hollow ideas.

                • amarcheschi an hour ago

                  And they call themselves rationalist but still believe low quality studies about iq (which of course find whites to be higher iq than other ethnicities).

                  the more you dig deep, the more it's the old classism, racism, ableism, misoginy, dressed in a shiny techbro coat. No surprise musk and thiel like them

        • refulgentis 2 hours ago

          Oh my. I had no idea until now, that was exactly the same flavor and apparently, this is no coincidence.

          I'm not sure it was prophetic, it was a good survey of the field, but the claim was...a plot of grade schooler to PhD against year.

          I'm glad he got a paycheck from OpenAI at one point in time.

          I got one from Google in one point in time.

          Both of these projects are puffery, not scientific claims of anything, or claims of anything at all other than "at timestamp N+1, AI will be better than timestamp N, on an exponential curve"

          Utterly bog-standard boring claim going back to 2016 AFAIK. Not the product of considered expertise. Not prophetic.

          • amarcheschi 2 hours ago

            Furthermore, there were so many predictions by everyone - especially people with a vested interest for VC to make them flow money in - that something has to be true.

            Since the people on less wrong like bayesian statistics, the probability of having someone says the right thing given the assumption that there a shitton of people saying different things is... Not surprisingly, high

      • pjc50 2 hours ago

        > the White House looks to jail the USA Champion AI Guy for some reason

        Who?

  • croes 2 hours ago

    Task:

    Create an imahe of an analog clock that show 09:30 a.m

    Last time I checked ChatGPT failed miserably, took my 10 year old nephew a minute.

    Maybe it's bad to extrapolate those trends beacuse there is no constant growth. How looked the same graph when self driving took off and how is it now?

    • keybits 13 minutes ago

      Claude created an SVG as an artifact for me - it's pretty good: https://claude.site/artifacts/b6d146cc-bed8-4c76-a8cd-5f8c34...

      The hour hand is pointing directly at 9 when it should be between 9 and 10.

      It got it wrong the first time (the minute hand was pointing at 5 (25 past). I told it that and it apologised and fixed it.

    • kqr an hour ago

      Interesting. I tried this with 3.5 Sonnet and it got it on the first attempt, using CSS transformations to specify the angle of the hour hand.

      It failed, even with chain-of-thought prompting, when I asked for an SVG image, because it didn't realise it needed to be careful when converting the angle of the hour hand to Cartesian coordinates. When prompted to pay extra attention to that, it succeeds again.

      I would assume models with chain-of-thought prompting baked in would perform better on the first attempt even at an SVG.

    • wickedsight 39 minutes ago

      I like this one! Just tried in in o3, it generated 10:10 3 times. Then it got frustrated and wrote a python program to do it correctly. Then I passed that image into o4 and got a realistic looking one... That still showed 10:10.

      Search for 'clock' on Google Images though and you'll instantly see why it only seems to know 10:10. I'll keep trying this one in the future, since it really shows the influence of training data on current models.

  • dumbasrocks 4 hours ago

    I hear these superforecasters are really good at predicting what happens in the next ten minutes

  • pjc50 2 hours ago

    > Does this mean that AI will be able to replace basically all human labor by the end of 2031?

    Betteridge's Law: no.

    Even the much more limited claim of AI replacing all white-collar keyboard-and-email jobs in that time frame looks questionable. And the capex is in trouble: https://www.reuters.com/technology/microsoft-pulls-back-more...

    On the other hand, if that _does_ happen, what unemployment numbers are you expecting to see in that time period?

    • wickedsight 29 minutes ago

      > if that _does_ happen, what unemployment numbers are you expecting to see in that time period

      None, because we always want more. No technological advancement until now has caused people (in general) to stop wanting more and enjoy doing less work with the same output. We just increase output and consume that output too.

  • nojs 5 hours ago

    I think the author is missing the point of why these forecasts put so much weight on software engineering skills. It’s not because it’s a good measure of AGI in itself, it’s because it directly impacts the pace of further AI research, which leads to runaway progress.

    Claiming that the AI can’t even read a child’s drawing, for example, is therefore not super relevant to the timeline, unless you think it’s fundamentally never going to be possible.

    • croes an hour ago

      Or you just reach the limit faster.

      Research is a like a maze, going faster on the wrong track doesn't bring you to the exit.

    • refulgentis 4 hours ago

      If I gave OpenAI 100K engineers today, does that accelerate their model quality significantly?

      I generally assumed ML was compute constrained, not code-monkey constrained. i.e. I'd probably tell my top N employees they had more room for experiments rather than hire N + 1, at some critical value N > 100 and N << 10000.

      • rjknight 3 hours ago

        I think it depends on whether you think there's low-hanging fruit in making the ML stack more efficient, or not.

        LLMs are still somewhat experimental, with various parts of the stack being new-ish, and therefore relatively un-optimised compared to where they could be. Let's say we took 10% of the training compute budget, and spent it on an army of AI coders whose job is to make the training process 12% more efficient. Could they do it? Given the relatively immature state of the stack, it sounds plausible to me (but it would depend a lot on having the right infrastructure and practices to make this work, and those things are also immature).

        The bull case would be the assumption that there's some order-of-magnitude speedup available, or possibly multiple such, but that finding it requires a lot of experimentation of the kind that tireless AI engineers might excel at. The bear case is that efficiency gains will be small, hard-earned, or specific to some rapidly-obsoleting architecture. Or, that efficiency gains will look good until the low-hanging fruit is gone, at which point they become weak again.

        • quonn 3 hours ago

          It may sound plausible, but the actual computations are very simple, dense and highly optimised already. The model itself has room for improvements, but this is not necessarily something that an engineer can do, it requires research.

          • fc417fc802 35 minutes ago

            > very simple, dense and highly optimised already

            Simple and dense, sure. Highly optimized in a low level math and hardware sense but not in a higher level information theoretic sense when considering the model as a whole.

            Consider that quantization and compression techniques can achieve on the order of 50% size reduction. That strongly suggests to me that current models aren't structured in a very efficient manner.

  • ec109685 6 hours ago

    And by 2035, AI will be able to complete tasks that would take millennia for humans to complete.

    https://xkcd.com/605/

    • ben_w 2 hours ago

      Already has in some fields — AlphaFold solved the folding of 2e8 proteins, whereas all of humanity only managed 1.7e5 in the last 60 years: https://en.wikipedia.org/wiki/AlphaFold

      Even just the training run for modern LLMs would take humans millennia to go through the same text.

    • jasonsb 3 hours ago

      Good one, though I wouldn’t be surprised if this one came true.

      • baq 3 hours ago

        It’s already happened. Replace AI with computers and start in 1000AD. You’ll find it’s true until ~1920, at which point the impossibility wall just kinda flew past us.

  • Earw0rm 5 hours ago

    "Forecasters" are grifters preying on naive business types who are somehow unaware that an exponential and the bottom half of a sigmoid look very much like one another.

  • d--b 5 hours ago

    > About the author: Peter Wildeford is a top forecaster, ranked top 1% every year since 2022.

    What?

  • boxed 6 hours ago

    > This is important because I would guess that software engineering skills overestimate total progress on AGI because software engineering skills are easier to train than other skills. This is because they can be easily verified through automated testing so models can iterate quite quickly. This is very different from the real world, where tasks are messy and involve low feedback — areas that AI struggles on

    Tell me you've never coded without telling me you've never coded.

    • nopinsight 5 hours ago

      > software engineering skills are easier to train than other skills.

      I think the author meant it's easier to train (reasoning) LLM models on [coding] skills than most other tasks. I agree on that. Data abundance, near-immediate feedback, and near-perfect simulators are why we've seen such rapid progress on most coding benchmarks so far.

      I'm not sure if he included high-level software engineering skills such as designing the right software architecture for a given set of user requirements in that statement.

      ---

      For humans, I think the fundamentals of coding are very natural and easy for people with certain mental traits, although that's obviously not the norm (which explains the high wages for some software engineers).

      Coding on large, practical software systems is indeed much more complex with all the inherent and accidental complexity. The latter helps explain why AI agents for software engineering will require some human involvement until we actually reach full-fledged AGI.

      • Earw0rm an hour ago

        Coding != software engineering.

        LLMs are impressively good at automating small chunks of code generation, and can probably be made even more effective at it with CoT, agents and so on (adversarial LLMs generating code and writing unit tests, perhaps? Depending on how the compute cost of running many iterations of that starts to stack up).

        I'm still not convinced that they can do software engineering at all.

    • zurfer 3 hours ago

      Same is true for chess. Easy for computers, hard for humans. Every smartphone has enough compute to beat the best human chess player in the world.