Notes on Anthropic's Computer Use Ability

(composio.dev)

79 points | by todsacerdoti a day ago ago

105 comments

  • acrooks a day ago

    I've built a couple of experiments using it so far and it has been really interesting.

    On one hand, it has really helped me with prototyping incredibly fast.

    On the other, it is prohibitively expensive today. Essentially you pay per click, in some cases per keystroke. I tried to get it to find a flight for me. So it opened the browser, navigated to Google Flights, entered the origin, destination etc. etc. By the time it saw a price, there had already been more than a dozen LLM calls. And then it crashed due to a rate limit issue. By the time I got a list of flight recommendations it was already $5.

    But I think this is intended to be an early demo of what will be possible in the future. And they were very explicit that it's a beta: all of this feedback above will help them make it better. Very quickly it will get more efficient, less expensive, more reliable.

    So overall I'm optimistic to see where this goes. There are SO many applications for this once it's working really well.

    • steveBK123 a day ago

      I guess I'm confused there's even a use case there. It's like "let me google that for you". I mean Siri can return me search results for flights.

      A real killer app would be something that is adaptive and smart enough to deal with all the SEO/walled gardens in the travel search space, actually understanding the airlines available and searching directly there as well as at aggregators. It could also be integrated with your Airline miles accounts and all suggested options to use miles/miles&cash/cash, etc.

      All of that is far more complex than .. clicking around google flights on your behalf and crashing.

      Further, the real killer app is that it is bullet proof enough that you entrust it to book said best flight for you. This requires getting the product to 99.99% rather than the perpetual 70-80% we are seeing all these LLM use cases hit.

      • sithadmin a day ago

        The airline booking + awards redemption use case is a mostly solved problem. Harcore milage redemption enthusiasts use paid tools like ExpertFlyer that present a UI and API for peeking into airline reservation backends. It has a steep learning curve, for sure.

        ThePointsGuy blog tried to implement something that directly tied into airline accounts to track milage/points and redemption options, but I believe they got slapped down by several airlines for unauthorized scraping. Airlines do NOT like third parties having access to frequent flier accounts.

        • acrooks a day ago

          While the strategy to find good deals / award space is a solved problem, the search tools to do so aren't. Tools like ExpertFlyer are super inefficient: it permits you to search for maximum one origin + one destination + one airline per search. What if you're happy to go to anywhere in Western Europe? Or if you want to check several different airlines? Then all of a sudden your one EF search might turn into dozens. And as you say, pretty much all of the aggregator tools are getting slapped down by airlines so they increasingly have more limited availability and some are shutting down completely.

          And then add the complexity that you might be willing to pay cash if the price is right ... so then you add dozens more searches to that on potentially many websites.

          All of this is "easy" and a solved problem but it's incredibly monotonous. And almost none of these services offer an API, so it's difficult to automate without a browser-based approach. And a travel agent won't work this hard for you. So how amazing would it be instead to tell an AI agent what you want, have it pretend to be you for a few minutes, and get a CSV file in your inbox at the end.

          Whether this could be commercialised is a different question but I'm certainly going to continue building out my prototype to save myself some time (I mean, to be fair, it will probably take more time to build something to do this on my behalf but I think it's time well spent).

          • steveBK123 a day ago

            Yes I think this points to the need for adaptiveness which remains humans edge.

            We don't need PBs of training data, millions of compute, and hours upon hours of training.

            You could sit down a moderately intelligent intern as a mechanical turk to perform this workflow with only a few minutes of instruction and get a reasonably good result.

            • pinko a day ago

              Underrated comment.

            • jaggs a day ago

              Ah, but I think you're overlooking one major factor. Convenience. A lot of the spontaneous stuff we do ("hey why don't we pop down to x tomorrow?", or "do you fancy a quick curry?") are things you're not going to book with a Turk. BUT you definitely would fire up a quick agent on your way to the shower and have it do all the work for you while you're waxing your armpits. :) Agentic work is starting super slow, but once the wrinkles are worked out, we'll see a world where they're doing a huge amount of heavy lifting for the drudge stuff. For an example see Her - sorry! :)

        • steveBK123 a day ago

          Yes, that seems to be the larger challenge. The search tools I have used will work for a while until they don't. Real cat & mouse game.

          Hence the "adaptive" part of my comment.

          It really needs to be a client side agent.

    • danielbln a day ago

      Haiku 3.5 wol be here soon, and will before long support tool use and vision, so that should help a lot with cost.

    • kordlessagain a day ago

      It’s running in the browser but connected to a VM, right? When you say crashed, what did it do?

    • inquisitor27552 a day ago

      time is also a huge factor on this one, should be a nice metric

      god the future is here haha

  • bonoboTP a day ago

    This kind of stuff is an existential threat to ad-based business models and upselling. If users no longer browse the web themselves, you can't show them ads. It's a monumental, Earth-shattering problem for behemoth like Google but also normal websites. Lots of websites (such as booking.com) rely on shady practices to mislead users and upsell them etc. If you have a dispassionate, smart computer agent doing the transaction, it will only buy what's needed to accomplish the task.

    There will be enormous push towards steering these software agents towards similarly shady practices instead of making them act in the true interest of the user. The ads will be built into the weights of the model or something.

    • tracerbulletx a day ago

      Ads will move to the layer of the new interface when that happens. Also a computer can't watch a youtube video for you or look at funny cat pictures. You can still put ads next to things people want to look at.

      • rty32 a day ago

        Care to elaborate on the idea? I suppose you mean that ads will come to this "computer use" tool itself. Now, will users keep it in the foreground, when they already expect the tool to do (almost) everything for them?

        • steveBK123 a day ago

          I think the point is - don't be so naive. Companies are investing near trillions into developing models, training models, compute, datacenter, nuclear reactors, etc.

          Is the endgame some free/cheap tool that abstracts away the entire ad based web economy to the benefit of end users?

          Imagine something closer to a super duper smart useful Siri/Alexa that feeds you product recommendations, paid placement, and other ads interspersed with your actual request response.

          Hey Siri what temperature is it? It's 45 and going to be chilly today, a North Face jacket might be handy today.. can I recommend you a few models? What's your size?

          • bonoboTP a day ago

            Or it's just simply going to make purchases and flight bookings based on paid boosts from online stores and airlines. They will simply say that it's making a holistic assessment, not simply based on the final price but the overall reputability etc. It's to avoid fraud and to streamline experience based on personalized machine learning algorithm result yadda yadda.

            I mean, what really are ads and dark tactics (like those observed on accommodation booking websites)? They are ways to influence purchasing decisions.

            If the decision is offloaded to AI, then logically ways to sway the AI decision will be developed. Such as backroom deals, hidden prompts and rules governing the assessment of the AI in making choices.

      • lupusreal a day ago

        Only a matter of time before computers are generating videos of cats cuter than any real cat.

    • kredd a day ago

      Still not a problem for Meta/TikTok/YouTube though, as people go there to consume content on purpose. But I agree, will be fun to see how Google and others will deal with it.

    • thenaturalist a day ago

      > smart computer agent doing the transaction

      None of these agents are smart.

      And if purchases become agentic, fine print or other shady tricks hidden in business terms will be how businesses draw consumers in.

      Also, none of this will be existential, earth-shattering or enourmous until compute power per watt comes to a degree where all of this is economical at scale.

    • Facemelters a day ago

      the ads will target the latent biases of the agentic AI, just like they do with humans

      • steveBK123 a day ago

        and/or the ad dollars will move into the decision layer and the AI will make different decisions / recommendations to your request, depending on who is bidding the most..

        Imagine the most dystopian outcomes and you'll probably be closer than "well I don't have to see ads anymore!"

    • voytec a day ago

      > This kind of stuff is an existential threat to ad-based business models and upselling.

      Sounds great. But corporations will find a way to fuck over their users for inverstors' gains in no time.

  • _heimdall a day ago

    I'm all for the MVP approach and shipping quickly, though I'm really surprised they went with image recognition and tooling for injecting mouse/keyboard events for automating human tasks.

    I wonder why leveraging accessibility tools for this wouldn't have been a better option. Browsers and operating systems both have pretty comprehensive tooling for accessibility tools like screen readers, and the whole point of those tools is to act as a middle man to programmatically interpret and interact with what's on screen.

    • danielbln a day ago

      I think the reason is that this is the most general implementation. It doesn't need playwright or have access to the DOM or anything else, if it has a screen and mouse/keyboard, then it will work. That's quite powerful (if slow and pricey, at the moment).

      • _heimdall 16 hours ago

        Unless I'm mistaken, playwright doesn't actually use the accessibility tree directly. It does have quite a few APIs for accessing nodes based on a11y attributes, but I could have sworn those were glorified query selectors rather than directly accessing the accessibility tree.

        Last I checked on it, maybe a year ago, there were browser proposals for standardizing the accessibility tree APIs but they were very early discussions and seemed pretty well stuck.

        That would be a good reason for Anthropic using image processing here though, short of forking open source a11y tools there may not have been a simple way to use accessibility data to interact.

    • infecto a day ago

      Those sound like stop gaps at best. Its pretty clear the intended goal here. APIs are easy to integrate with but most systems in existence only have a visual interface intended for humans.

      The end goal here is clear, being able to interface with anything available in the screen.

      • ryukafalz a day ago

        Accessibility tools are made for humans. If there is information only available visually and not via a screen reader or other accessibility tools, that is a problem that needs to be addressed.

        • infecto a day ago

          Accessibility tools are I find are never as great as the source. Just because they are made for humans does not mean they are an improvement. I imagine at best a stopgap as image models improve.

        • lupusreal a day ago

          It's like making a robot that can walk up stairs instead of roll up wheelchair ramps or use elevators. It's harder, but more capable.

    • ctoth a day ago

      Hmmm.

      I think I'm a blind user of the late 2024 Internet.

      I use a screen reader for everything.

      Now I'm paranoid that I'm just a computer use model, testing this a11y API hypothesis, in training.

    • elif a day ago

      Crazy that this needs to be said but 'Computer use' is far more expansive of a domain than Internet browsing...

      • _heimdall a day ago

        From what I've seen of this new product (I've never used it), it sounds like its specifically trying to mimic a human user and they went with image recognition plus faked input devices.

        That approach is a weird one to me, though only as long as its limited to the current use. If this is just another test bed for a much more broad tool that could rely on accessibility APIs that makes sense.

  • VBprogrammer a day ago

    Well, this just opened up a new phase in the captcha wars.

    • sunilkumardash9 a day ago

      It certainly did; it's like a Pandora's box. Unless they lobotomize, we can expect Qwen and Deepseek to release open-source models.

    • jazzyjackson a day ago

      Captchas were already outsourced to cheap labor, maybe 10 or 20 cents a pop? AI using image interpretation is not any cheaper so the captchas efficacy is unchanged

  • belval a day ago

    The product I would like to see out of this is a way to automate UI QA.

    Ideally it would be given a persona and a list of use cases, try to accomplish each task and save the state where you/it failed.

    Something like a Chrome lighthouse but for usability. Bonus point if it can highlight what part of my documentation is using mismatched terminology making it difficult for newcomers to understand what button I am referring to.

    • steveBK123 a day ago

      I've seen similar sentiment even pre-LLM that AI would help automate other forms of testing, and I just don't quite see it.

      Implementing tests is not the hard part. You could make that an intern project or hire a consultant for 3 months. The hard part is the interpretation of results.

      That is - making a thing that spits out tickets/alerts is easy. The signal/noise tuning and actual investigation workflows are the hard part and still very manual & human operated. I don't see LLM mouse/keyboard control changing that yet.

      • belval a day ago

        > making a thing that spits out tickets/alerts is easy.

        I don't really believe that what I am asking for is hard, yet I still can't buy it as far as I know.

        > actual investigation workflows are the hard part and still very manual & human operated.

        Sure but it would allow your QA worker to have pre-tested usecase-based path with some flag on whether or not they may be problematic with a screen-recording and some timestamp of where it went wrong.

        These will always need human-in-the-loop to vet the findings before cutting a ticket to development team.

        • steveBK123 a day ago

          Fair - I'm not personally familiar with state of the art in UI QA automation, but I know theres been various screen recording type tools available for a decade+ with mixed success.

          I come more from a "big data" background, and have dealt with CTOs who think "can't we just use AI?" is the answer to data quality checking multi-PB data lakes with 1000s of unique datasets from 100s of vendors. That is - they don't want to staff a data quality team, they think you can just magic it all away.

          The answer was always - sure, but you are fixated on the easy part - anomaly detection. Actual data analysis on what broke, when, how, why, and escalating to data provider was always 95% of the work. Someone needs to look at the exhaust, and there will be exhaust every single day.. so you can kill your dev teams productivity or actually staff an operations team responsible for the tickets the thing spits out.

          • belval a day ago

            That's fair and I don't think I have a good counter to this, it would be very easy for such a UI QA product to become just another "security vulnerability scanner" that cuts low severity tickets that nobody looks at.

  • imranq a day ago

    This is basically RPA with LLMs. And RPA is basically the worst possible solution to any problem.

    Agents won't get anywhere because any user process you want to automate is better done by creating APIs and creating a proper guaranteed interface. Any automated "computer use" will always be a one-off, absurdly expensive, and completely impractical.

    • danielbln a day ago

      There is plenty of legacy software out there that has no and will never have a nice API to integrate with. Those are the situations where the terrible solutions are either let a human do it or automate the human tool chain from a high level. This is the LLM spin on it. Is it an efficient or even good solution? Hell no, but if there is no other solution to automation (assuming that's the goal) then does that matter?

      • alpha_squared a day ago

        This is a severely under-appreciated perspective. A lot of software, especially in industries that are slow to change, is just not programming-friendly. There are no APIs and no access to underlying databases, just user-focused point-and-click.

        • rty32 a day ago

          My take is that those industries are also going to be very slow to adopt any AI tools, especially these, and for good reasons. We are looking at integrating LLM into our products, but we have customers that told us they can't use any of those, straightforward.

      • imranq a day ago

        I'd argue that this is not even a solution to begin with. If the LLM gets even one pixel value wrong, then at best the whole process breaks down. At worst, you could do some irreversible damage.

    • supriyo-biswas a day ago

      I could see this coming into Apple Intelligence for example; you could simply ask the browser to buy stuff off of your favorite store, or even do a chain of tasks like informing a contact off your list that you've bought said thing, etc.

      The possibilities are quite exciting, in fact, even though the technology isn't quite there yet.

      • imranq a day ago

        Apple should hook into app functions themselves instead of relying on UI. I would be really surprised if Apple made a browser automation tool, since that would be the complete opposite of the "it just works" credo

    • a day ago
      [deleted]
  • nilstycho a day ago

    It seems like a cheaper intermediate capability would be to give Claude the ability to SSH to your computer or to a cloud container. That would unlock a lot of possibilities, without incurring the cost of the vision model or the difficulty of cursor manipulation.

    Does this already exist? If not, would the benefits be lower than I think, or would the costs be higher than I think?

    • reportgunner a day ago

      Not just benefits, costs but also the risks are to be considered here I think.

      • nilstycho a day ago

        What are the risks? Isn't this a strict subset of the risks of full desktop access? Claude can just open a GUI terminal with Computer Use. (I think.)

        • reportgunner a day ago

          Software posing as Claude that is actually a malware tricking an unsuspecting non-terminal user into executing it is what I was thinking about.

    • pier25 a day ago

      I would never allow an AI to SSH into a server.

      Just the other day someone used Claude to write a script to configure a server. It left a port open and the server was hacked hours later and used to attack other servers. Hetzner almost banned the hosting account.

      https://x.com/rameerez/status/1848707234068382001

      • jazzyjackson a day ago

        I had GPT4o walk me through configuring my RAID array, simple two drive duplication affair, and some command broke the configuration in new and mysterious ways - I can no longer get the drives to appear at all. So that will be the last time I copy paste anything from an AI into a shell.

      • nilstycho a day ago

        IIUC, Claude's "Computer Use" is roughly a remote desktop, which is a superset of a remote shell. I don't think I'm proposing anything with a greater risk than already exists.

    • a day ago
      [deleted]
    • kordlessagain a day ago

      I’m working on Webwright which presents as a shell. It’s on GitHub.

      • nilstycho a day ago

        Based on the flow diagram, that doesn't seem to be the same thing. Webwright seems to be a shell as a tool for me, enhanced with AI features. I'm suggesting the shell as a tool for AI.

        Webwright is a front-end shell that presents to me; I'm suggesting a back-end shell that presents to Claude.

        It doesn't appear that Webwright enables tool-use. In other words, there's no task-oriented feedback loop between AI-provided shell commands and the results of those shell commands. Please correct me if that's not right.

    • ActionHank a day ago

      Anecdotal, but I think if you mention it in any discussions in a corporate alarm bells will go off because HACKERS use SSH.

      This _seems_ more like a normal user so clearly could not do anything nefarious. /s

  • elif a day ago

    If it's only downside is cost, and cost is prohibitively expensive for all practical uses,

    Why didn't this project start with https://huggingface.co/meta-llama/Llama-3.2-11B-Vision

  • Jayakumark a day ago

    Any idea on how does Sonnet does this, is the image annotated with bounding boxes on text boxes etc. along with its coordinates before sending to sonnet and it responds with box name back or co-ordinate back or ? is SAM2 used for segmenting everything before sending to sonnet ?

  • cl42 a day ago

    I really, really like this new product/API offering. Still crashes quite a bit for me and obviously makes mistakes, but shows what's possible.

    For the folks who are more savvy on the Docker / Linux front...

    1. Did Anthropic have to write its own "control" for the mouse and keyboard? I've tried using `xdotool` and related things in the past and they were very unreliable.

    2. I don't want to dismiss the power and innovation going into this model, but...

    (a) Why didn't Adept or someone else focused on RPA build this?

    (b) How much of this is standard image recognition and fine-tuning a vision model to a screen, versus something more fundamental?

  • ko_pivot a day ago

    At the end of the day, the fundamental dynamic here is human creativity. We are taking a tool, the LLM, and stretching it to its limit. That’s great, but that doesn’t mean we are close to AGI. It means we are AGI.

    • macawfish a day ago

      This is an insightful comment, though it just goes to show how rigid the framing is of "natural vs. artificial" or "human vs. machine". None of this stuff has any vitality outside of _some_ relationship or interface with people.

      • bitwize a day ago

        Yeah, it makes the owner class richer while driving the marginal cost of labor to zero, at which point the working class can't sell their labor at all and starve.

        • bamboozled a day ago

          This would assume the rich some how oppresses everyone to pieces. If I have access to all this wonderful automation tech, I'm sure as fuck not going to sit around and starve, I'm going to try automate my food production to make more food, more efficiently ?

          • computerthings a day ago

            > If I have access to all this wonderful automation tech

            But "you" don't, that is precisely the point. The speed at which the gap between rich and poor grows keeps increasing, after all -- the rest is commentary --, and people who right now send people to die and murder in wars for oil, and what not not, will not suddenly start sharing when they fully captured all means of production for good. That's like hoping the person who keeps stealing your shit every chance he has, leaving you in sickness or death without a thought will give you a billion dollars once all the locks on your house have rusted off completely and you no longer have means to call the police.

    • sunilkumardash9 a day ago

      This is a step towards a human-machine hybrid world. Putting a human in the loop can do wonders. Sure, it is expensive now, but the subsequent iterations will crush it.

      • im3w1l a day ago

        Have you heard of Centaur chess? A human and a machine would team up to find the best chess moves against another similar team. It's not a thing anymore. Computers have advanced so much that humans can't really contribute in any meaningful sense.

        • steveBK123 a day ago

          All these AI models do quite well in games because there are set rules, finite moves, and they can iterate in a tight loop (without humans) to get immediate feedback on pass/fail.

          I think this is what differentiates the speed at which AIs have gotten from ok -> good -> great -> better than humans at say chess, versus say driving a car, summarizing a paper, understanding human requests, recommending music, etc.

          I think a lot of people are extrapolating the rate of progress & possible accuracy rates from chess bots to domains that do not compare.

          • pksebben a day ago

            Would've said that about writing and text, about three years ago.

        • tsunamifury a day ago

          Once we realize we can make machines that can beat us in ways we can’t even understand, I wonder if will question if we have always been influenced this way by an exterior force

          • bamboozled a day ago

            Sounds like an interesting idea, do you mean, like the concept of "fate" is the type of external force you describe ?

            • tsunamifury 5 hours ago

              Yea something like that or like the benejeserit from dune.

        • bamboozled a day ago

          Is the point of your comment to make people feel depressed ?

          Either we're going to use these tools to augment our abilities or basically just become wiped out, at least our jobs will be, and there is no plan to provide support for anyone. Maybe the tech will make the transition to a post employment world so swift we don't even feel any negative economic effects at all, but let's see.

          • im3w1l a day ago

            There is no such bigger point. I'm just trying to look at the situation realistically.

          • exe34 a day ago

            is this a cry for help? there's always alcohol and drugs, they can't take that away from us!

            (unless the robots of the future are like Bender)

            • bamboozled a day ago

              I'm not saying I am depressed, but I mean, the comment just sounded like such a major downer.

              • exe34 a day ago

                reality often is, unfortunately.

                • bamboozled a day ago

                  Depressing hasn’t been the reality for the majority of people over the last 100 years of technological progress. You could die from a scratch or a kidney stone 100s ago.

                  Maybe this is the cliff , but it feels unlikely.

                  • exe34 a day ago

                    these things weren't depressing back then, they were normal. you can be depressed anywhere or anywhen, and similarly happy in any circumstances.

  • alt-glitch a day ago

    I wonder if I can hook up `scrcpy` with this and give it control over an Android. Can it drag the mouse? That'd be needed to navigate the phone at least.

    • elif a day ago

      For my home automation system my aspiration is to give the AI control of an android virtual machine

  • azinman2 a day ago

    I saw those demoed yesterday. The model was asked to create a cool visualization. It ultimately tried to install steamlit and go its page, only to find its own Claude software running streamlit, so as part of debugging it killed itself. Not ready to let that go wild on my own computer!

  • namanyayg a day ago

    What are some good use cases for this? Something that a business can be built around

    • nerdix a day ago

      It's main use case is making the average office worker feel the same existential dread that some programmers feel when they see a LLM spit out a bunch of code in mere seconds.

      • jampekka a day ago

        And when having to pick up the pieces when someone actually uses the thing.

      • lagniappe a day ago

        I don't normally go for comedy on HN but this one got an audible chuckle.

        TBH, while I giggle at the thought of anybody being replaced, I dont think it's likely, it's just that the standards and expectations have shifted in some domains. I think if anything LLM's raised the tide for everyone (in relevant roles) and we're all able to move a little faster now, like when we went from abacus to calculator a while back, just a different scale of magnitude.

      • Slump a day ago

        You're not far off. Anecdotal but I shared the Anthropic demo video and a few articles in a company slack and a lot of PM's/admin folks that are only tangentially aware of LLM powered use cases at this point shared that sentiment. Welcome to the party folks!

    • bicijay a day ago

      A lot of automations around softwares without APIs.

      QA...

    • conception a day ago

      There are dozens of RPA businesses. UIPath etc.

      We had one that was a simple download from here and login and upload there. Having the accounting team be able to automate that versus devops is huge.

    • infecto a day ago

      Have you seen UiPath? The enterprise automation business is massive.

    • a day ago
      [deleted]
  • sys32768 a day ago

    So robotic process automation gains intelligence and we can train an AI intern to assist with tasks.

    Our own personal digital squire.

    Then eventually we become assistants to AI.

  • guzik a day ago

    I’m not sure if anyone else has really tried, but I’ve tested it a few times and never hit meaningful results.

    1) I tried using it for QA for my SaaS but agent failed multiple times to fill out a simple form, ending with it saying the task was successfully completed.

    2) It couldn’t scrape contact information from a website where the details weren’t even that hidden.

    3) I also tried sending a message on Discord, but it refused, saying it couldn’t do so on someone else's behalf.

    I mean, I’m excited for what the future holds, but right now, it’s not even in beta.

  • sheepscreek a day ago

    Appreciate the TL;DR as I got what I was looking for. Burning $30 for just trying it out doesn’t make it sound so promising at the moment.

  • Karthikeya a day ago

    didn't see anyone actually going into production with any of this stuff, man, this hype cycle just continues.

    • Kiro a day ago

      What is "this stuff"? Computer use was released 3 days ago and I would say the opposite is true for LLMs in general: it's overused in production and shoehorned into stuff that doesn't need it.

  • a day ago
    [deleted]
  • martythemaniak a day ago

    I've been been hacking on a web browsing agent the last few weeks and it's given me some decent understanding of what it'd take to get this working. My approach has been to make it general-purpose enough so that I describe the mechanics of surfing the web, without building in specific knowledge about tasks or website. Some things I've learned.

    1. Pixels and screenshots (video really) and keyboard/mouse events is definitely the purest and most proper way to get agents working in the long term, but it's not practical today. Cost and speed are big obvious issues, but accuracy is also low. I found that GTP4o (08-06) is just plain bad at coordinates and bounding boxes and naively feeding it screenshots just doesn't work. As a practical example, another comment mentions trying to get a list of flight recommendations from Claude computer use and it costing $5, if my agent is up for that task (haven't tested this), it would cost $0.10-$0.25.

    2. "feature engineering" helps a lot right now. Explicitly highlighting things and giving the model extra context and instructions on how to use that context, how to augment the info it sees on screenshots etc. It's hard to understand things like hover text, show/hide buttons, etc from pure pixels.

    3. You have to heavily constrain and prompt the model to get it to do the right thing now, but when it does it, it feels magic.

    4. It makes naive, but quite understandable mistakes. The kinds of mistakes a novice user might make and it seems really hard to get this working. A mechanism to correct itself and learn is probably the better approach rather than trying to make it work right from the get-go in every situation. Again, when you see the agent fail, try again and succeed the second time based on the failure of the previous action, it's pretty magical. The first time it achieved its objective, I just started laughing out loud. I don't know if I've ever laughed at a program I've written before.

    It's been very interesting working on this. If traditional software is like building legos, this one is more like training a puppy. Different, but still fun. I also wonder how temporary this type of work is, I'm clearly doing a lot of manual work to augment the model's many weaknesses, but also models will get substantially better. At the same time, I can definitely see useful, practical computer use from model improvements being 2-3 years away.

  • utkarsh-dixit a day ago

    This is such a idiotic hype-cycle, they just fine-tuned a model over vision API. I really don't understand why everyone is loosing their mind over this

    • soham123 a day ago

      They showed a really cool literal example of what's coming. it's almost a chatgpt like movement.

      • diffeomorphism a day ago

        Which one? The article has four examples, none of which are particularly "cool" or impressive.

        If anything, the examples involving moving the mouse to the address bar or getting csv's of results are very poor examples, because we can already do that much better without "computer use".

    • Mistletoe a day ago

      Because this is the last thing we can think of right now and after this is an abyss for the stock market that everyone knows is inevitable, but thinks we can avoid.