OmniParser for Pure Vision Based GUI Agent

(microsoft.github.io)

70 points | by fzliu 6 hours ago ago

9 comments

  • Smaug123 4 hours ago

    To a considerable extent, we are stuck in the world we live in; but I am reminded of a quote by Guillaume Allais:

    > My entire job seems to be repeating variations of "never start by forgetting the user's stated intent only to then attempt to guess it".

  • trq_ 5 hours ago

    This is awesome, can't wait for evals against Claude Computer Use!

    • amelius 2 hours ago

      Can we first test this with basic sysadmin work in a simple shell?

      Can't wait to replace "apt get install" by "gpt get install" and then have it solve all the dependency errors by itself.

  • s3tt3mbr1n1 2 hours ago

    Has anyone gotten this to work?

    Copying the repo and downloading the models through HuggingFace or manually does not seem to work, you get errors indicating missing files.

  • amelius 2 hours ago

    Can it detect ads and mask them out?

  • akshayKMR 2 hours ago

    Does it also tell the coordinates (x,y) of the annotated box w.r.t. the screenshot dimensions?

  • jauntywundrkind 5 hours ago

    I have a little bit of a vice of enjoying some "idle" games. I have intended to do some very basic manual screen carving & ocr & computer vision to try to "read" my state in these games, & have multi-actor "play" models for them, just for fun really & to decrease time sunk gaming (by spending significant time coding/learning).

    This certainly seems like it has a lot of promise to make that much much much easier. Game UI's are less uniform so maybe this might be harder or not easily be applicable, but hopefully

    • _adamb 3 hours ago

      As someone who has done this to many games over a few decades, I can definitively say: 100% of the time, it ruins the fun of the game.

      I can't say exactly why. Maybe you feel like you haven't earned it. Maybe it's the idle nature of farming that we really enjoy...

      • fragmede 2 hours ago

        Depends what you consider fun, and how far you take it. Some people enjoy programming more than repetitive clicking in a GUI. For a clicker game, writing a bot lets you iterate on strategies easier - is it faster to get to level 2 if I buy the upgrade for A or B first? For Trackmania, it lets you get a world record and a YouTube video with 14M views.

        https://youtu.be/Dw3BZ6O_8LY