Is it decided then that screenshots are better input for LLMs than HTML, or is that still an active area of investigation? I see that y'all elected for a mostly screenshot-based approach here, wondering if that was based on evidence or just a working theory.
Not sure, I think there is a lot of research being done here.
Actually, browser use works quite well with vision turned off, it just sometimes gets stuck at some trivial vision tasks. The interesting thing is that screenshot approach is often cheaper than cleaned up html, because some websites have HUGE action spaces.
We looked at some papers (like ferret ui) but i think we can do much better on html tasks. Also, there is a lot of space to improve the current pipeline.
Do you think they do any super fancy magic other than for example how ferret ui does their classification of ui elements? It could be very interesting to test head to head hope much better you can make computer use by adding html (it’s much better from our quick testing, just don’t know the numbers).
I am one of the authors of this open source project Agent-E. I would love if people here give it a go:
https://github.com/EmergenceAI/Agent-E
It doesn't have any vision yet, all clever DOM manipulation.
We have a Discord for feedback and welcome contributions to make it better.
Screenshots aren't as accurate or context-rich as HTML, but they let you bypass the hassle of building logic for permissions and authentication across different apps to pull in text content for the LLM.
Context length + API cost is right now main bottleneck for huge HTML + CSS files. The extraction here is already quite efficient but still:
with past messages + system prompt + sometimes extracted text + extracted interactive elements you are quickly already around 2500 tokens (for gpt-4o 0.01$).
If you extract entire HTML and CSS your cost + inference time are quickly 10x.
Nope:
1280x1024 low resolution with gpt-4o are 85 tokens so approx $0.0002 (so 100x cheaper). For high resolution its apporx $0.002
https://openai.com/api/pricing/
I do this for my extension [0] but the HTML is often too large for context window sizes . I end up doing scraping of the relevant pieces before sending to LLM.
I doubt screenshots would be better input considering that eg <select> box options and other markup are hisden visually until a user interacts with something
Per research across companies, both help, screenshots are worse, but marginally.
The computer use stuff gets me fired up enough that I end up always sharing this, even though when delivered concisely without breaking NDAs, it can sound like a hot take:
The whole thing is a dead end.
I saw internal work at a FAANG on this for years, and even in the case where the demo is cooked up to "get everything right", intentionally, to figure out the value of investing in chasing this further...its undesirable, for design reasons.
It's easy to imagine being wow'd by the computer doing something itself, but when its us, its a boring and slow way to get things done thats scary to watch.
Even with the stilted 100% success rate, our meatbrains cheerily emulated knowing its < 100%, the fear is akin to watching a toddler a month into walking, except if the toddler had your credit card and a web browser and instructions to buy a ticket.
I humbly and strongly suggest to anyone interested in this space to work towards CLI versions of this concept. Now, you're nonblocking, are in a more "native" environment for the LLM, and are much cheaper.
If that sounds regressive and hardheaded, Microsoft, in particular, has plenty of research on this subject, and there's a good amount from diverse sources.
Note the 20%-40% success rates they report, then, note that completing a full task successfully represents a product series of 20%-40%. To get an intuition for how this affects the design experience, think how annoying it is to have to repeat a question because Siri/Assistant/whatever voice assistant don't understand it, and they have roughly ~5 errors per 100 words.
(handwaving) I'd rather be in a loop of "here's our goal. here's latest output from the CLI. what do we type into the CLI" than the GUI version of that loop.
Hmm, but this how we handle it? We just have a CLI that outputs exactly, goal, state, and asks user for more clarity if needed, no GUI.
The original idea was to make it completely headless.
I'm sorry I'm definitely off today, and am missing it, I appreciate your patience.
I'm thinking maybe the goal/state stuff might have clouded my point. Setting aside prompt engineering, just thinking of the stock AI UIs today, i.e. chat based.
Then, we want to accomplish some goal using GUI and/or CLI. Given the premise that I'd avoid GUI automation, why am I saying CLI is the way to go?
A toy example: let's say the user says "get my current IP".
If our agent is GUI-based, maybe it does: open Chrome > type in whatismyip.com > recognize IP from screenshot.
If our agent is CLI-based, maybe it does: run the curl command to fetch the user's IP from a public API (e.g. curl whatismyip.com) > parse the output to extract the IP address > return the IP address to the user as text.
In the CLI example, the agent interacts with the system using native commands (in this case, curl) and text outputs, rather than trying to simulate GUI actions and parse screenshot contents.
Why do I believe thats preferable over GUI-based automation?
1. More direct/efficient - no need for browser launching, screenshot processing, etc.
2. More reliable - dealing with only structured text output, rather than trying to parse visual elements
3. Parallelizable: I can have N CLI shells, but only 1 GUI shell, which is shared with the user.
4. In practice, I'm basing that off observations of the GUI-automation project I mentioned, accepting computer automation is desirable, and...work I did to build an end-to-end testing framework for devices paired to phones, both iOS and Android.
What the? Where did that come from?
TL;DR: I love E2E tests, for years, and it was stultifying to see how little they were used beyond the testing team due to flakiness. Even small things like "Launch the browser" are extremely fraught. How long to wait? How often do we poll? How do we deal with some dialog appearing in front of the app? How do we deal with not having the textual view hierarchy for the entire OS?
Thanks man, starred yours too, it's super cool to see all these projects getting spun up!
I see Cerebellum is vision only. Did you try adding HTML + screenshot? I think that improves the performance like crazy and you don't have to use Claude only.
Just saw Skyvern today on previous Show HNs haha :)
I had an older version that used simplified HTML, and it got to decent performance with GPT-4o and Gemini but at the cost of 10x token usage. You are right, identifying the interactable elements and pulling out their values into a prompt structure to explicitly allow the next actions can boost performance, especially if done with grammar like structured outputs or guidance-llm. However, I saw that Claude had similar levels of performance with pure vision, and I felt that vision + more training would beat a specialized DOM algorithm due to "the bitter lesson".
BTW I really like your handling of browser tabs, I think it's really clever.
It's impressive, but to me it seems like the saddest development experience...
agent = Agent(
task='Go to hackernews on show hn and give me top 10 post titels, their points and hours. Calculate for each the ratio of points per hour.',
llm=ChatOpenAI(model='gpt-4o'),
)
await agent.run()
Passing prompts to a LLM agent... waiting for the black box to run and do something...
I mean, is that really much different than an API? I pass a query, and get data back, and rarely do I get to inspect the mechanisms behind what's returning that data.
Let's say in 1 year, more agents than humans interact with the web.
Do you think:
1. Websites release more API functions for agents to interact with them
or
2. We will transform with tools like this the UI into functions callable by agents and maybe even cache all inferred functions for websites in a third party service?
It is called screen scraping, where text rendered on screen/monitors are being scraped either in browser or even in windows os even on android screen , thats how softwares like autohotkey and all do automation windows or android screen can be dumped into heirarchical xml along with x y coordinates of its ui elements along with text they contain which can be uses o click scroll scrape text
a) There were a test / eval suite to determine which model works best for what. It could be divided into a training suite and test suite. (Training tasks can be used for training, test tasks only for evaluation.) Possibly a combination of unit tests against known xpaths, and integration tests that are multi-step and end in a measurable result. I know the web is constantly changing, so I'm not 100% sure how this should work.
b) There were some sort of wiki, or perhaps another repo or discussion board, of community-generated prompt recipes for particular actions.
A)
For Mind2Web: because there are multiple ways to reach a goal state - any thoughts how to evaluate if a task was successful? Should we let the LLM/ other LLM evaluate it?
Maybe can build a database for which sites / pages work best with HTML vs Screenshots, and then can choose to use HTML to save on token cost / improve latency if possible.
This looks interesting. I am really impressed with MultiOn [0], and I tried to make something similar, but it's quite challenging doing it with a Chrome extension.
I also saw one doing Captcha solving with Selenium [1].
I will keep an eye on your development, good luck!
I actually built a Chrome extension that runs Claude computer use if you’d like to try it out! [0] It’s currently awaiting approval in the Chrome Web Store.
After having spent the last several years building a popular Chrome extension for browser automation [1], I was excited to see if LLMs could actually build automations end-to-end based on a high-level description. Unfortunately, they still get confused quite easily so the holy grail has yet to come. Still fun to play around with though!
Thanks! Have you tried captcha solving with [1]? It's very tricky sometimes, especially with non standard "verify human" - maybe you could solve it by writing Selenium/Javascript code directly and then execute it.
I don’t know a lot about this but do you have full power of Selenium or not? That would be also very interesting aproach especially when “local” browser models get very good
From 3 days playing around it, I couldn’t find a way to use selenium or playwright in the browser.
What I did though is having a loop to send instructions from playwright.
For instance, I will open the browser, and then enter a loop to await for instructions (can be from event such as redis) to execute again in the same browser. But still, it’s based on the session instantiated by playwright.
Looks nice. I find the cleaning HTML step in our cleaning pipeline extremely important, otherwise there is no real benefit from just using a general vision model and clicking coordinates (and whole HTML is just way too many tokens). How do you guys handle that?
You can just extend this e.g. with adding data to database, sending notifications, extracting specific data format ect...
make sure to also accept your added function when its called in act() https://github.com/gregpr07/browser-use/blob/main/src/contro...
This looks really interesting. The first hurdle, though, that prevents me from experimenting with this on my job is the lack of a license.
I see in the readme that it claims that it is MIT licensed, but there is no actual license file or information in any of the source files that I could find.
I was really excited about the original claude computer use until I watched the youtube videos and saw it was only running in a docker container. I wish I could run something like this on a real machine.
What makes a docker container not a "real machine"? Docker programs are running natively, just like any other program, without emulation/virtualization. Its not like a virtual machine (unless you're on one of the lesser operating systems like Windows or OSX), its just configuring some settings in the Linux kernel to isolate the process from other processes. Its basically just an enhanced chroot.
wants to have cron, so I can ask it to check with my local parking agency, every day or every 12 hours, do I have a parking ticket, and to raise a warning if I do. Or to check with county jail and see if someone is still there/not there. Or check the price of a product on Amazon every hour and warn when it's changed (aka camelcamelcamel but local). Search craigslist/zillow/Facebook marketplace for items until one shows up. etc.
Is it decided then that screenshots are better input for LLMs than HTML, or is that still an active area of investigation? I see that y'all elected for a mostly screenshot-based approach here, wondering if that was based on evidence or just a working theory.
Not sure, I think there is a lot of research being done here.
Actually, browser use works quite well with vision turned off, it just sometimes gets stuck at some trivial vision tasks. The interesting thing is that screenshot approach is often cheaper than cleaned up html, because some websites have HUGE action spaces.
We looked at some papers (like ferret ui) but i think we can do much better on html tasks. Also, there is a lot of space to improve the current pipeline.
Would be really cool if you could tie this into Claude's computer use APIs!
Do you think they do any super fancy magic other than for example how ferret ui does their classification of ui elements? It could be very interesting to test head to head hope much better you can make computer use by adding html (it’s much better from our quick testing, just don’t know the numbers).
I am one of the authors of this open source project Agent-E. I would love if people here give it a go: https://github.com/EmergenceAI/Agent-E It doesn't have any vision yet, all clever DOM manipulation. We have a Discord for feedback and welcome contributions to make it better.
Screenshots aren't as accurate or context-rich as HTML, but they let you bypass the hassle of building logic for permissions and authentication across different apps to pull in text content for the LLM.
Can’t you just make a browser extension to haveaccess to the HTML and CSS, and use LLMs from that?
Context length + API cost is right now main bottleneck for huge HTML + CSS files. The extraction here is already quite efficient but still: with past messages + system prompt + sometimes extracted text + extracted interactive elements you are quickly already around 2500 tokens (for gpt-4o 0.01$).
If you extract entire HTML and CSS your cost + inference time are quickly 10x.
Aren't screenshots far larger than this?
Nope: 1280x1024 low resolution with gpt-4o are 85 tokens so approx $0.0002 (so 100x cheaper). For high resolution its apporx $0.002 https://openai.com/api/pricing/
I do this for my extension [0] but the HTML is often too large for context window sizes . I end up doing scraping of the relevant pieces before sending to LLM.
[0] https://chromewebstore.google.com/detail/namebrand-check-for...
Next step is to represent also the structure of the HTML tree in the extracted elements for better understanding, maybe images are then less needed.
I doubt screenshots would be better input considering that eg <select> box options and other markup are hisden visually until a user interacts with something
Per research across companies, both help, screenshots are worse, but marginally.
The computer use stuff gets me fired up enough that I end up always sharing this, even though when delivered concisely without breaking NDAs, it can sound like a hot take:
The whole thing is a dead end.
I saw internal work at a FAANG on this for years, and even in the case where the demo is cooked up to "get everything right", intentionally, to figure out the value of investing in chasing this further...its undesirable, for design reasons.
It's easy to imagine being wow'd by the computer doing something itself, but when its us, its a boring and slow way to get things done thats scary to watch.
Even with the stilted 100% success rate, our meatbrains cheerily emulated knowing its < 100%, the fear is akin to watching a toddler a month into walking, except if the toddler had your credit card and a web browser and instructions to buy a ticket.
I humbly and strongly suggest to anyone interested in this space to work towards CLI versions of this concept. Now, you're nonblocking, are in a more "native" environment for the LLM, and are much cheaper.
If that sounds regressive and hardheaded, Microsoft, in particular, has plenty of research on this subject, and there's a good amount from diverse sources.
Note the 20%-40% success rates they report, then, note that completing a full task successfully represents a product series of 20%-40%. To get an intuition for how this affects the design experience, think how annoying it is to have to repeat a question because Siri/Assistant/whatever voice assistant don't understand it, and they have roughly ~5 errors per 100 words.
Could you elaborate on the CLI idea? I am intrigued but not exactly sure what you mean.
(handwaving) I'd rather be in a loop of "here's our goal. here's latest output from the CLI. what do we type into the CLI" than the GUI version of that loop.
I hope that's clearer, I'm a bit over-caffeinated
Hmm, but this how we handle it? We just have a CLI that outputs exactly, goal, state, and asks user for more clarity if needed, no GUI. The original idea was to make it completely headless.
I'm sorry I'm definitely off today, and am missing it, I appreciate your patience.
I'm thinking maybe the goal/state stuff might have clouded my point. Setting aside prompt engineering, just thinking of the stock AI UIs today, i.e. chat based.
Then, we want to accomplish some goal using GUI and/or CLI. Given the premise that I'd avoid GUI automation, why am I saying CLI is the way to go?
A toy example: let's say the user says "get my current IP".
If our agent is GUI-based, maybe it does: open Chrome > type in whatismyip.com > recognize IP from screenshot.
If our agent is CLI-based, maybe it does: run the curl command to fetch the user's IP from a public API (e.g. curl whatismyip.com) > parse the output to extract the IP address > return the IP address to the user as text.
In the CLI example, the agent interacts with the system using native commands (in this case, curl) and text outputs, rather than trying to simulate GUI actions and parse screenshot contents.
Why do I believe thats preferable over GUI-based automation?
1. More direct/efficient - no need for browser launching, screenshot processing, etc.
2. More reliable - dealing with only structured text output, rather than trying to parse visual elements
3. Parallelizable: I can have N CLI shells, but only 1 GUI shell, which is shared with the user.
4. In practice, I'm basing that off observations of the GUI-automation project I mentioned, accepting computer automation is desirable, and...work I did to build an end-to-end testing framework for devices paired to phones, both iOS and Android.
What the? Where did that come from?
TL;DR: I love E2E tests, for years, and it was stultifying to see how little they were used beyond the testing team due to flakiness. Even small things like "Launch the browser" are extremely fraught. How long to wait? How often do we poll? How do we deal with some dialog appearing in front of the app? How do we deal with not having the textual view hierarchy for the entire OS?
try openinterpreter for cli computer automation.
Nice, thanks
Awesome project, starred! Here are some other projects for agentic browser interactions:
* Cerebellum (Typescript): https://github.com/theredsix/cerebellum
* Skyvern: https://github.com/Skyvern-AI/skyvern
Disclaimer: I am the author of Cerebellum
Thanks man, starred yours too, it's super cool to see all these projects getting spun up!
I see Cerebellum is vision only. Did you try adding HTML + screenshot? I think that improves the performance like crazy and you don't have to use Claude only.
Just saw Skyvern today on previous Show HNs haha :)
I had an older version that used simplified HTML, and it got to decent performance with GPT-4o and Gemini but at the cost of 10x token usage. You are right, identifying the interactable elements and pulling out their values into a prompt structure to explicitly allow the next actions can boost performance, especially if done with grammar like structured outputs or guidance-llm. However, I saw that Claude had similar levels of performance with pure vision, and I felt that vision + more training would beat a specialized DOM algorithm due to "the bitter lesson".
BTW I really like your handling of browser tabs, I think it's really clever.
Fair, also Claude probably only gets better on this since they kinda want people to use Computer use. We are gonna try to do best of both worlds.
Thanks man, Magnus came up with it this morning haha!
It's impressive, but to me it seems like the saddest development experience...
Passing prompts to a LLM agent... waiting for the black box to run and do something...I mean, is that really much different than an API? I pass a query, and get data back, and rarely do I get to inspect the mechanisms behind what's returning that data.
Let's say in 1 year, more agents than humans interact with the web.
Do you think: 1. Websites release more API functions for agents to interact with them or 2. We will transform with tools like this the UI into functions callable by agents and maybe even cache all inferred functions for websites in a third party service?
It is called screen scraping, where text rendered on screen/monitors are being scraped either in browser or even in windows os even on android screen , thats how softwares like autohotkey and all do automation windows or android screen can be dumped into heirarchical xml along with x y coordinates of its ui elements along with text they contain which can be uses o click scroll scrape text
It would be amazing if you:
a) There were a test / eval suite to determine which model works best for what. It could be divided into a training suite and test suite. (Training tasks can be used for training, test tasks only for evaluation.) Possibly a combination of unit tests against known xpaths, and integration tests that are multi-step and end in a measurable result. I know the web is constantly changing, so I'm not 100% sure how this should work.
b) There were some sort of wiki, or perhaps another repo or discussion board, of community-generated prompt recipes for particular actions.
A) we plan on thoroughly testing that with Mind2Web dataset. They have a very robust set of (persistant) selectors
B) so, shadcn for prompts for web agents haha :) but I agree, that would be SICK! Just go to browseruse and get the prompt for your specific use case
A) For Mind2Web: because there are multiple ways to reach a goal state - any thoughts how to evaluate if a task was successful? Should we let the LLM/ other LLM evaluate it?
Maybe can build a database for which sites / pages work best with HTML vs Screenshots, and then can choose to use HTML to save on token cost / improve latency if possible.
This looks interesting. I am really impressed with MultiOn [0], and I tried to make something similar, but it's quite challenging doing it with a Chrome extension.
I also saw one doing Captcha solving with Selenium [1].
I will keep an eye on your development, good luck!
[0] https://www.multion.ai/ [1] https://github.com/VRSEN/agency-swarm
I actually built a Chrome extension that runs Claude computer use if you’d like to try it out! [0] It’s currently awaiting approval in the Chrome Web Store.
After having spent the last several years building a popular Chrome extension for browser automation [1], I was excited to see if LLMs could actually build automations end-to-end based on a high-level description. Unfortunately, they still get confused quite easily so the holy grail has yet to come. Still fun to play around with though!
[0] https://autobrowser.ai/
[1] https://news.ycombinator.com/item?id=29254147
Thanks! Have you tried captcha solving with [1]? It's very tricky sometimes, especially with non standard "verify human" - maybe you could solve it by writing Selenium/Javascript code directly and then execute it.
I haven’t but watched a video doing it with this framework.
With captcha, worst case scenario is using a service to do it as part of the agent flow. See 2captcha service
Will def try it.
what are the challenges with the Chrome extension path?
You need to call an API to screenshot the page, then figure out the JavaScript code to execute it. It’s not as easy as it might sound.
Playwright and selenium automate the browser itself, but with the chrome extension you need to use the context of the current browser.
I’m not an expert in browser automation so found it challenging moving from playwright to make it completely browser based.
I don’t know a lot about this but do you have full power of Selenium or not? That would be also very interesting aproach especially when “local” browser models get very good
From 3 days playing around it, I couldn’t find a way to use selenium or playwright in the browser.
What I did though is having a loop to send instructions from playwright.
For instance, I will open the browser, and then enter a loop to await for instructions (can be from event such as redis) to execute again in the same browser. But still, it’s based on the session instantiated by playwright.
I have built something similar at https://github.com/ComposioHQ/composio/tree/master/python/co...
Compatible with any LLMs and agentic framework
Looks nice. I find the cleaning HTML step in our cleaning pipeline extremely important, otherwise there is no real benefit from just using a general vision model and clicking coordinates (and whole HTML is just way too many tokens). How do you guys handle that?
In case anyone else was looking for the functions available to the LLM: https://github.com/gregpr07/browser-use/blob/68a3227c8bc97fe...
You can just extend this e.g. with adding data to database, sending notifications, extracting specific data format ect... make sure to also accept your added function when its called in act() https://github.com/gregpr07/browser-use/blob/main/src/contro...
Yes, this plus reasoning and ask user for additional info. More here https://github.com/gregpr07/browser-use/blob/main/src/agent/...
This looks really interesting. The first hurdle, though, that prevents me from experimenting with this on my job is the lack of a license.
I see in the readme that it claims that it is MIT licensed, but there is no actual license file or information in any of the source files that I could find.
Thanks you, love the feedback! Will add the license.
Let me know how it goes if you try it.
I was really excited about the original claude computer use until I watched the youtube videos and saw it was only running in a docker container. I wish I could run something like this on a real machine.
What makes a docker container not a "real machine"? Docker programs are running natively, just like any other program, without emulation/virtualization. Its not like a virtual machine (unless you're on one of the lesser operating systems like Windows or OSX), its just configuring some settings in the Linux kernel to isolate the process from other processes. Its basically just an enhanced chroot.
Running natively doesn't make it a real machine. If I run iOS Simulator, it also runs natively, but I'm pretty sure it's not a real iPhone ;)
You can run browser use in your terminal, no need for Docker containers. Just clone it and run it
Does it work with COM objects/Java applications?
I'd give my interest in Hell for a way to have a script plug in data into a Java app.
No I think that’s a completely different beast, this is only for html/websites.
But would be interesting to see what happenes with our pipeline with pure vision model. Did you mean something else?
have you tried Claude's "computer use"?
wants to have cron, so I can ask it to check with my local parking agency, every day or every 12 hours, do I have a parking ticket, and to raise a warning if I do. Or to check with county jail and see if someone is still there/not there. Or check the price of a product on Amazon every hour and warn when it's changed (aka camelcamelcamel but local). Search craigslist/zillow/Facebook marketplace for items until one shows up. etc.