Yeah sure! We mentioned a little bit in the post that we stumbled upon this problem when working on a synthetic data contract. Internally, we were building software at the time to fully automate the process of creating synthetic data, all the way from purchasing assets to building scene layouts to rendering images. We realized after 2 months of building that there wasn't a great need for this problem nor could we build something minimal but feature-complete that we could give to people and iterate off of. We also saw the writing on the wall re: building 3D world models (see WorldLabs and NVIDIA's new 3D world models). We're excited to see where this goes!
It's quite a crowded market, browser workflow automation. What are you guys trying to do differently? For me, as someone who does a bunch of browser automation tasks, the real issue is stability. So many tools fall over on non-trivial workflows like complex forms with tabs. Fix that and you may open up a large testing automation market.
Some of the things we're trying to differently are 1) making flows deterministic by using vision models that have a consistent output given an image and an input query and 2) breaking up flows into these smaller atomic actions in order to improve consistency as well.
RE: workflows w/ complex forms and tabs -- do you have some sites that are good examples of this? We'd love to see how Simplex does.
Apparently a simplex is a concept in geometry. I'm guessing this is the intended meaning of the term in the name. When most decently well informed, native speakers of English hear "simplex," though, they are likely to hear "herpes simplex" and so are likely to wonder, as I did, whether, or why, the program is named after a sexually transmitted disease.
Which IIRC is named for that geometry thingie GP mentioned, that this method uses to represent the problem, with the solutions being vertices/edges of the thingie.
Great demo, straight to the point. It might be nice to have some kind of feedback mechanism if it can't find the element on the page, or if it's partially cut off. For example, I changed the GitHub profile to my own (https://github.com/spro) in the example and it doesn't scroll down far enough for the whole image. I imagine in general it would be nice to scroll to an element ID (or even element description using the vision models) instead of a hardcoded value.
Side-note: The comment for the frequency graph is wrong, it mentions stars instead.
RE: feedback mechanism -- yep, a feedback mechanism is definitely something we're thinking of adding. Since we use VLMs that are trained to always output coordinates (i.e., they don't have a way to say "not on the page"), we're probably going to try fine tuning with some negative examples to see if we can build that feedback mechanism in.
One way to hack the scrolling to an element is to first run extract_bbox on a natural language description (in your case for GitHub it might be "follow button") then take the Y coordinate of that element and scroll that number of pixels. I just wrote this bit of code that I tested and it brings the contribution graph into full view:
Hey thanks! We think we're pretty different from Skyvern as Skyvern provides a full agent loop for users + is no code. We wanted to build something at a lower level (i.e., no high-level planner that chooses what tasks to do for you) and we wanted to be able to program our own logic alongside the intelligence part because using code is what's natural for us as developers.
I hadn't heard of UI Vision but just took a look at it -- it also looks like a no-code solution that's a Chrome extension, so I'd say the main differences are the same as the differences w/ Skyvern -- we're lower level and meant to be used by developers.
I'd add that we're also able to directly extract parts of websites that have no official API -- for example, an image of the GitHub contribution graph like I show in the video demo.
Very cool, any chance of either open-sourcing it or allowing the browser part to be self-hosted? i.e. to act on websites hosted in a local lan/vpn?
Also, did you evaluate https://github.com/browser-use/browser-use by any chance and have any comments about it? I'm assuming it was too AI-heavy based on what you said about claude/etc?
Thanks John! We should make that more clear in the docs. You can set browser=None on initialization and Simplex will create a local browser instance that can run on your local websites. We're not planning to be open source right now since a large portion of our product is custom vision models + inference speedups through hosting.
Browser Use is another YC company. Probably the biggest difference is that they're more agent focused while we're more lower level -- in the Claude Computer Use camp like you mentioned.
VLMs are great - I have been able to use it for a similar project too [1]. And it's only going to get better. Congratulations on the product launch what's your VLM model for this?
We finetune our own VLMs for this -- unfortunately prefer not to share which ones we use specifically! ClickClickClick looks awesome, have you heard of FerretUI (https://arxiv.org/pdf/2404.05719)? Pretty similar idea.
Yes, I tried a similar one called "omniparser" - where the issue was it was missing annotating some UI elements sometimes. Moreover, Gemini and Molmo worked right out of the box without needing any fine tune.
extract_text returns all the lines of text it finds within an element. Looks like for your case it selected a majority of the page from the description, so it's returning most of the text on the page.
We currently don't return failure cases (just closest match) -- but good suggestion! We'll fine tune on some negative cases and see if we can catch them
Nice job! Does this only work with Python? In your docs under API reference, I only see 1 endpoint '/find-element'. If I wanted to use this via REST API without python, is that possible? Also, what kind of pricing can be expected?
Thanks! We're happy to add more REST API functionality -- could you email us at founders@simplex.sh if you're up to talk further about what you'd like to see? RE: pricing, we're still figuring this out!
Thanks, we hope so! That's an interesting conversation -- I'd argue that the graph-based no-code solutions you're referring to are for a different set of people than those commonly found on HN. As a developer I didn't particularly want to work with those tools, so we built this to supercharge our own code-based flows instead. I actually don't think the node-based flows are horrible at all since they successfully enable non-developers/less technical people to use agents/build easy automations.
Yep, any websocket URL works. I see that Browserless offers a websocket URL, so you can use them! You just need to pass in a Playwright browser object into the Simplex constructor.
Ah, we've whitelisted some websites to prevent abuse -- amazon.in wasn't one of them. I just whitelisted amazon.in for you and tried the search query -- looks like it works!
Yep, we also run all our python code in sandboxes and a few other security measures!
Site loading is pretty now slow due to a combination of 1. traffic we're currently getting, 2. running a remote session, 3. running a few large vision language models, and 4. adding waits to allow pages to load/allow you to view your search results. We're working on cutting our latency significantly since it leads to a better development experience.
Would you mind sharing the story behind your pivot from on-demand photorealistic vision datasets[0] to browser automation?
[0] https://www.ycombinator.com/launches/Lbx-simplex-on-demand-p...
Yeah sure! We mentioned a little bit in the post that we stumbled upon this problem when working on a synthetic data contract. Internally, we were building software at the time to fully automate the process of creating synthetic data, all the way from purchasing assets to building scene layouts to rendering images. We realized after 2 months of building that there wasn't a great need for this problem nor could we build something minimal but feature-complete that we could give to people and iterate off of. We also saw the writing on the wall re: building 3D world models (see WorldLabs and NVIDIA's new 3D world models). We're excited to see where this goes!
It's quite a crowded market, browser workflow automation. What are you guys trying to do differently? For me, as someone who does a bunch of browser automation tasks, the real issue is stability. So many tools fall over on non-trivial workflows like complex forms with tabs. Fix that and you may open up a large testing automation market.
Some of the things we're trying to differently are 1) making flows deterministic by using vision models that have a consistent output given an image and an input query and 2) breaking up flows into these smaller atomic actions in order to improve consistency as well.
RE: workflows w/ complex forms and tabs -- do you have some sites that are good examples of this? We'd love to see how Simplex does.
VC funding.
Apparently a simplex is a concept in geometry. I'm guessing this is the intended meaning of the term in the name. When most decently well informed, native speakers of English hear "simplex," though, they are likely to hear "herpes simplex" and so are likely to wonder, as I did, whether, or why, the program is named after a sexually transmitted disease.
I thought it's related to the simplex method in linear programming
Which IIRC is named for that geometry thingie GP mentioned, that this method uses to represent the problem, with the solutions being vertices/edges of the thingie.
Interesting, TIL herpes is top of mind for me, but not for most of HN.
If you gave me the chance to come up with 500 things related to simplex, herpes would not be there.
I would never have made that link.
I for one did not make that connection. Only thing I thought of was https://simplex.chat/ (which is fortunately not marketed as a dating app)
Great demo, straight to the point. It might be nice to have some kind of feedback mechanism if it can't find the element on the page, or if it's partially cut off. For example, I changed the GitHub profile to my own (https://github.com/spro) in the example and it doesn't scroll down far enough for the whole image. I imagine in general it would be nice to scroll to an element ID (or even element description using the vision models) instead of a hardcoded value.
Side-note: The comment for the frequency graph is wrong, it mentions stars instead.
RE: feedback mechanism -- yep, a feedback mechanism is definitely something we're thinking of adding. Since we use VLMs that are trained to always output coordinates (i.e., they don't have a way to say "not on the page"), we're probably going to try fine tuning with some negative examples to see if we can build that feedback mechanism in.
One way to hack the scrolling to an element is to first run extract_bbox on a natural language description (in your case for GitHub it might be "follow button") then take the Y coordinate of that element and scroll that number of pixels. I just wrote this bit of code that I tested and it brings the contribution graph into full view:
But then it incorrectly picks the code review/submissions/etc. graph as the green tile graph -- we'll look into it!re: frequency graph typo -- just pushed a fix, thanks!
tried:
get error:So sorry about this, it looks like our servers went down overnight. Just fixed, should be good now!
Looks pretty cool. How do you distinguish Simplex from Skyvern or UI.Vision?
Hey thanks! We think we're pretty different from Skyvern as Skyvern provides a full agent loop for users + is no code. We wanted to build something at a lower level (i.e., no high-level planner that chooses what tasks to do for you) and we wanted to be able to program our own logic alongside the intelligence part because using code is what's natural for us as developers.
I hadn't heard of UI Vision but just took a look at it -- it also looks like a no-code solution that's a Chrome extension, so I'd say the main differences are the same as the differences w/ Skyvern -- we're lower level and meant to be used by developers.
I'd add that we're also able to directly extract parts of websites that have no official API -- for example, an image of the GitHub contribution graph like I show in the video demo.
Very cool, any chance of either open-sourcing it or allowing the browser part to be self-hosted? i.e. to act on websites hosted in a local lan/vpn?
Also, did you evaluate https://github.com/browser-use/browser-use by any chance and have any comments about it? I'm assuming it was too AI-heavy based on what you said about claude/etc?
Thanks John! We should make that more clear in the docs. You can set browser=None on initialization and Simplex will create a local browser instance that can run on your local websites. We're not planning to be open source right now since a large portion of our product is custom vision models + inference speedups through hosting.
Browser Use is another YC company. Probably the biggest difference is that they're more agent focused while we're more lower level -- in the Claude Computer Use camp like you mentioned.
VLMs are great - I have been able to use it for a similar project too [1]. And it's only going to get better. Congratulations on the product launch what's your VLM model for this?
1. A framework to use/control mobile phones via any LLM - https://github.com/BandarLabs/clickclickclick
We finetune our own VLMs for this -- unfortunately prefer not to share which ones we use specifically! ClickClickClick looks awesome, have you heard of FerretUI (https://arxiv.org/pdf/2404.05719)? Pretty similar idea.
Yes, I tried a similar one called "omniparser" - where the issue was it was missing annotating some UI elements sometimes. Moreover, Gemini and Molmo worked right out of the box without needing any fine tune.
I'm surprised you named your framework clickclickclick instead of taptaptap.
I've tried the following code:
It outputs all texts from the page. Is it expected? Maybe it should fail indicating element could not be found?extract_text returns all the lines of text it finds within an element. Looks like for your case it selected a majority of the page from the description, so it's returning most of the text on the page.
We currently don't return failure cases (just closest match) -- but good suggestion! We'll fine tune on some negative cases and see if we can catch them
Nice job! Does this only work with Python? In your docs under API reference, I only see 1 endpoint '/find-element'. If I wanted to use this via REST API without python, is that possible? Also, what kind of pricing can be expected?
Thanks! We're happy to add more REST API functionality -- could you email us at founders@simplex.sh if you're up to talk further about what you'd like to see? RE: pricing, we're still figuring this out!
I have to say, looks both simple and 20 times better then those horrible no-code solutions with boxes and arrows.
I think you are onto something here.
Thanks, we hope so! That's an interesting conversation -- I'd argue that the graph-based no-code solutions you're referring to are for a different set of people than those commonly found on HN. As a developer I didn't particularly want to work with those tools, so we built this to supercharge our own code-based flows instead. I actually don't think the node-based flows are horrible at all since they successfully enable non-developers/less technical people to use agents/build easy automations.
Nice.
Does it work with sites that have cloudflare antibot (or similar) functions?
Can this use a cloud browser API like browserless?
Yep, any websocket URL works. I see that Browserless offers a websocket URL, so you can use them! You just need to pass in a Playwright browser object into the Simplex constructor.
it fails for this query search("amazon.in", "fitness watch")
Ah, we've whitelisted some websites to prevent abuse -- amazon.in wasn't one of them. I just whitelisted amazon.in for you and tried the search query -- looks like it works!
+1 for considering how it could have been abused to visit blocked sites and circumvent online protections.
The syntax is also intuitive, although site loading seems pretty slow but maybe that's just the playground and the paid access is much faster.
Yep, we also run all our python code in sandboxes and a few other security measures!
Site loading is pretty now slow due to a combination of 1. traffic we're currently getting, 2. running a remote session, 3. running a few large vision language models, and 4. adding waits to allow pages to load/allow you to view your search results. We're working on cutting our latency significantly since it leads to a better development experience.