24 comments

  • mritchie712 5 hours ago

    I've been looking for this exact thing. A couple questions:

    Are your agents good at testing other agents? e.g. I want your agent to ask our agent a few questions and complete a few UI interactions with the results.

    How do you handle testing onboarding flows? e.g. I want your agent to create a new account in our app (https://www.definite.app/) and go thru the onboarding flow (e.g. add Stripe and Hubspot as integrations).

    • ttamslam 5 hours ago

      > Are your agents good at testing other agents? e.g. I want your agent to ask our agent a few questions and complete a few UI interactions with the results.

      I'd say this is one of our strong suits I think, specifically the UIs tend to be easy to navigate for browser agents, and the LLM as a judge offers pretty good feedback on chat quality and it can inform later actions. (I'd be remiss not to mention though that a good LLM eval framework like Braintrust is probably the best first line though)

      > How do you handle testing onboarding flows?

      We can step through most onboarding flows if you start from logged out state & give the context it'll need (i.e. a stripe test card, etc.) That said though, setting up integrations that require multi-page hops is still a pain point in our system and leaves a lot to be desired.

      Would love to talk more about your specific case and see if we can help! founders@propolis.tech

  • plasma an hour ago

    Neat! How do you handle state changes during tests, for example, in a todo app the agents are (likely) working on the same account in parallel or even as a subsequent run, some test data has been left behind or now data is not perhaps setup for a test run.

    I’m curious if you’d also move into API testing too using the same discovery/attempt approach.

    • mpapazian 44 minutes ago

      This is one of our biggest challenges, you're spot on! What we're working on taking this includes a memory layer that agents have access to - thus state changes become part of their knowledge and accounted for while conducting a test.

      They're also smart enough to not be frazzled by things having changed, they still have their objectives and will work to understand whether the functionality is there or not. Beauty of non-determinism!

  • not-chatgpt an hour ago

    Been looking for a solution exactly like this but I struggle to see how it is different than spinning 10 Atlas tabs with a 2 sentence prompt.

    • mpapazian an hour ago

      You could definitely do that and get some good results! But if you want a repeatable process with detailed bug reports (including logs, reasoning, etc.) and a large enough search area with agents that can continuously build an understanding of your app - that's us :)

      let's chat - founders@propolis.tech

  • cloudflare728 3 hours ago

    Can it find broken UI?

    Human can find and report broken UI easily by using common sense.

    Even though it is simple for human. Computer has no common sense and I am a machine learning expert. I tried and mostly failed to build a broken UI detector in my previous company. They had automated plugin upgradable process. That periodically broke UI.

    I tried to detect it my taking long screenshot, and you could select a image as working version, then later finding diff between 2 images. I kind of worked but not satisfactory.

    • mpapazian 2 hours ago

      The agents can definitely detect when something is off, given they're using VLMs. They don't necessarily compare it to previous versions, rather they have opinionated takes on whether something looks broken / off. So - yes!

  • 8note 5 hours ago

    fraud/abuse/compliance is a good usecase for this kinda thing - an abuse vector is kinda like a bug, except that the system does what its expected to do.

    testing for abuse stuff ive always found quite difficult, since to work well, you need to both create some real resources so you can delete/clean them up, and also you need to create a new test identity, since your abuse detection system should be deny listing found bad actors. the difficulty is that those sessions probably want to be open for like a week, so they can process both payments and refunds.

    can the agents check their email? other notification methods?

    • ttamslam 5 hours ago

      This is interesting, I think we've shied away a bit from security-ish use cases since it's outside of our personal core competencies, do you have examples of what tools exist today for catching things like that? Or is it totally adhoc?

      > can the agents check their email? other notification methods?

      Yes to email (for paying customers agents spin up with unique addresses), no to other notifications, but as soon as a paying customer has a use case for SMS, etc. we'll build it.

      • dfsegoat 5 hours ago

        OTP protected flow verification

    • dfsegoat 5 hours ago

      Really good call out re: email and other 'side-flows' - hopefully there is integration with something like Mailosaur.

      https://mailosaur.com/email-testing

  • ttamslam 6 hours ago

    Hey I'm Matt! Really excited to answer any questions.

    To elaborate a little bit on the "canary" comment --

    For a while at Airtable I was on the infra team that managed the deploy (basically click run and then sit and triage issues for a day), One of my first contributions on the team was adding a new canary analysis framework that made it easier to catch and rollback bugs automatically. Two things always bothered me about the standard canary release process:

    1) It necessarily treats some users as lower value, and thus more acceptable to risk exposing bugs to (this makes sense for things like free-tier, etc. but the more you segment out, the less representative and thus less effective your canary is). When every customer interaction matters (as is the case for so many types of businesses) this approach is harder to justify

    2) Low frequency / high impact bugs are really difficult to catch in canary analysis. While it’s easy to write metrics that catch glaring drops/spikes in metrics, more subtle high impact regressions are much harder and often require user reports (which we did not factor in as part of our canary). Example: how do you write a canary metric that auto rolls back when an enterprise account owner (small % of overall users) logs in and a broken modal prevents them from interacting with your website.

    I view what we’re building at Propolis as an answer to both of these things. I envision a deploy process (very soon) that lets us roll out to simulated traffic and canary on THAT before you actually hit real users (and then do a traditional staged release, etc.)

    • bfeynman 2 hours ago

      seems like you are misappropriating what canaries are useful and used for... they are designed to be lightweight and shallow... hence the name and whole analogy, canaries never were meant to determine if a mine was structurally unsafe etc

  • ekarabeg 3 hours ago

    This seems really interesting. I tried running a swarm on my landing page but didn't get a completed email. I'll try it again, though!

    • mpapazian 3 hours ago

      hi! Looking at your swarm results, you might not have given the swarm login credentials to use, which is why most of the runs are failing out. Please feel free to try it again and give them access.

  • mhb 4 hours ago

    Good video, but it looks like it plays twice. Should be ~3.5 minutes...

  • GeorgyM 3 hours ago

    Sounds interesting. Is this handling mobile as well?

    • mpapazian 3 hours ago

      We don't handle mobile yet, but we might get to it at a future date!

  • orliesaurus 2 hours ago

    Does it output playwright scripts?

    • ttamslam 2 hours ago

      We use playwright for interacting with the browser, so while it's not available by default, we do support bulk exporting tests as playwright to move off our platform or to customers who want to run deterministic versions of the tests on their own infra (you can also run them on ours!)

  • rvz 4 hours ago

    Looks like a great idea. Does this fully automate QA testing of websites including removing the human in the loop during testing?

    Once again, great product.

    • mpapazian 4 hours ago

      Great question! The swarm takes a first pass to generate tests and can continuously add as it runs again over time.

      In the off chance it misses specific tests - we have tools to let you build them directly with ai support, either by giving them objectives or dropping in a video of the actions you're taking!