Benchmarking leading AI agents against Google reCAPTCHA v2

(research.roundtable.ai)

38 points | by mdahardy 3 hours ago ago

30 comments

  • jngiam1 a few seconds ago

    I hypothesize that these AI agents are all likely higher than human performance now.

  • Xenoamorphous 2 hours ago

    I’m sure they do better than me. Sometimes I get stuck on an endless loop of buses and fire hydrants.

    Also, when they ask you to identify traffic lights, do you select the post? And when it’s motor/bycicles, do you select the guy riding it?

    • Sayrus an hour ago

      Testing those same captcha on Google Chrome improved my accuracy by at least an order of magnitude.

      Either that or it was never about the buses and fire hydrants.

      • ACCount37 an hour ago

        It's a known "issue" of reCaptcha, and many other systems like it. If it thinks you're a bot, it will "fail" the first few correct solves before it lets you through.

        The worst offenders will just loop you forever, no matter how many solves you get right.

    • hnburnsy an hour ago

      Pro tip, select a section you know is wrong, then de select it before submitting. Seems to help prove you are not a bot.

    • utopman an hour ago

      Not sure it is your case but I think I sometimes had to solve many of them when I am in my daily task rush. My hypothesis is that I solve them too fast for "average human resolving duration" recaptcha seems to expect (I think solving it too fast triggers bot fingerprint). More recently when I fall on a recaptcha to solve, I consciently do not rush it and feel have no more to solve more than one anymore. I don't think I have super powers, but as tech guy I do a lot a computing things mechanically.

    • datadrivenangel 2 hours ago

      That's not due to accuracy, you're getting tarpitted for not looking human enough.

    • sixhobbits 2 hours ago

      Didn't look a lot into this but I think the fact that humans are willing to do this in the "cents per thousand" or something range means that it's really hard to get much interest in automating it

    • mdahardy an hour ago

      While running this I looked at hundreds and hundreds of captchas. And I still get rejected on like 20% of them when I do them. I truly don't understand their algorithm lol

    • Semaphor an hour ago

      There's a browser extension to solve them. Buster.

  • cedws 13 minutes ago

    Would performance improve if the tiles were stitched together and fed to a vision model, and then tiles are selected based on a bounding box?

  • mehdibl 14 minutes ago

    Ok and then? Those models were not trained for this purpose.

    It's like the last hype over using generative AI for trading.

    You might use it for sentiment analysis, summarization and data pre-processing. But classic forecast models will outperform them if you feed them the right metrics.

  • flakiness an hour ago

    To be honest I'm surprised how well it holds. I expected close-to-total collapse. It'll be a matter of time I guess, but still.

    • mdahardy an hour ago

      Same! As we talk about in the article, the failures were less from raw model intelligence/ability than from challenges with timing and dynamic interfaces

  • xnx 2 hours ago

    Seems like Google Gemini is tied for the best and is the cheapest way to solve Google's reCAPTCHA.

    Will be interesting to see how Gemini 3 does later this year.

    • mdahardy an hour ago

      After watching hundreds of these runs, Gemini was by far the least frustrating model to observe.

      • dgacmu 20 minutes ago

        In my admittedly limited-domain tests, Gemini did _far_ better at image recognition tasks than any of the other models. (This was about 9 months ago, though, so who knows what the current state of things). Google has one of the best internal labeled image datasets, if not the best, and I suspect this is all related.

    • bena an hour ago

      Makes sense, what do you think it was trained on?

  • kjok an hour ago

    If not today, models will get better at solving captchas in the near future. IMHO, the real concern, however, is cheap captcha solving services.

  • maknee an hour ago

    interesting results. why does reload/cross-tile have worse results? would be nice to see some examples of failed results (how close did it to solving?)

    • Youden 9 minutes ago

      I'm also curious what the success rates are for humans. Personally I find those two the most bothersome as well. Cross-tile because it's not always clear which parts of the object count and reload because it's so damn slow.

    • mdahardy an hour ago

      We have an example of a failed cross-tile result in the article - the models seem like they're much better at detecting whether something is in an image vs. identifying the boundaries of those items. This probably has to do with how they're trained - if you train on descriptions/image pairs, I'm not sure how well that does at learning boundaries.

      Reload are challenging because of how the agent-action loop works. But the models were pretty good at identifying when a tile contained an item.

  • PaulHoule 2 hours ago

    I know people were solving CAPTCHAS with neural nets (with PHP no less!) back in 2009.

    • golfer an hour ago

      Indeed, captcha vs captcha bot solvers has been an ongoing war for a long time. Considering all the cybercrime and ubiquitous online fraud today, it's pretty impressive that captchas have held the line as long as they have.

    • mdahardy an hour ago

      You could definitely do better than we do here - this was just a test of how well these general-purpose systems are out-of-the-box

  • ajsnigrutin an hour ago

    So, when do we reach a level where AI is better than humans and we remove captcha from pages alltogether? If you don't want bots to read content, don't put it online, you're just inconveniencing real people now.

    • cubefox 27 minutes ago

      They can also sign up and post spam/scams. There are a lot of those spam bots on YouTube, and there probably would be a lot more without any bot protection. Another issue is aggressive scrapers effectively DOSing a website. Some defense against bots is necessary.

  • WhereIsTheTruth an hour ago

    3 models only, can we really call that a benchmark?

  • guluarte an hour ago

    in other words reasoning call fill the context window with crap