Teaching GPT-5 to Use a Computer

(prava.co)

35 points | by Areibman 2 days ago ago

8 comments

This is cool though wanted to share a couple of thoughts for reflection:

I feel like your demo video is not the greatest one to highlight the capability. A browsing use case likely does require a key press->planning loop, but a gaming use case, or a well known software (e.g., excel), may be able to think ahead 10-20 key presses before needing the next loop / verification. The current demo makes it seem slow / prototype-like.

Also, the X/Y approach is interesting when thinking about a generic approach to screen management. But for example for browsers, you're likely adding overhead relative to just marking the specific div/buttons that are on screen and having those be part of the reasoning (e.g., "Click button X at div with path XX"). It may be helpful to think about the workflows you are going after and what kind of accelerated management you have over them.

daxfohl 7 hours ago

Very cool. I've been thinking for a while that this is where things will end up. While custom AI integrations per service/product/whatever can be better and more efficient, there's always going to be stuff that doesn't have AI integrations but your workflow will need to use.

Without this, AI is going to be limited and kloodgy. Like if I wanted to have AI run a FEA simulation on some CAD model, I have to wait until the FEA software, the CAD software, the corporate models repo, etc., etc. all have AI integrations and then create some custom agent that glues them all together. Once AI can just control the computer effectively, then it can look up the instruction manuals for each of these pieces of software online, and then just have at it e2e like a human would. It can even ping you over slack if it gets stuck on something.

I think once stuff like this becomes possible, custom AI integrations will become less necessary. I'm sure they'll continue to exist for special cases, but the other nice thing about a generic computer-use agent is that you can record the stream and see exactly what it's doing, so a huge increase in observability. It can even demo to human workers how to do things because it works via the same interfaces.

[-]

kevingadd 4 hours ago

One potential virtuous cycle here is that accessibility trees used by tools like screen readers are also a nice potential way for a model to consume information about what's on screen and how it can be interacted with. So it creates an additional incentive for improving the accessibility of new and existing software, because doing that lights up integration with future models.

[-]

alhirzel 40 minutes ago

This cycle starts with an integration for model developers. I wonder if anyone is working on a generic ARIA hookup, as well as whatever standards are necessary for desktop/smartphone integration?

deadbabe 3 hours ago

I imagine in the future someone will make an Agent-First OS that is entirely built from the ground up to be run by AI and runs off the assumption that there are no human users or that their usage is limited. That will be interesting, imagine all the things you could do differently, the design choices you could make. You lose a lot by accommodating human ergonomics.

[-]

Waterluvian 13 minutes ago

What might you imagine being different in an “agent first OS” compared to a terminal only Linux distribution?

yuliyp 7 hours ago

I can't help but feel like some sort of hybrid approach: use GPT5 for the strategy, then a more direct ML model for actually executing the strategy might work better than trying to use reasoning directly for input control, would work better than trying to reason your way through driving.

[-]

Philpax 6 hours ago

That's what the article describes, yes.