I recently experimented with Apple's Foundation Models framework, and I came away impressed at the speed and accuracy of the LLM. You can't ask it to build you a web app, but it can reliably translate a written instruction into tool use within your native app. I think there's a lot of merit to Apple's approach, using specialist tiny models like Ferret-UI Lite, though I don't think we'll see the full fruits of their labor for another year or two.
But it's a vision that I can get behind, where basic tasks like transcription, computer use, in-app tool, image understanding, etc, are local, secure and private.
I'm disappointed that they are taking the long way around, with screen shots and visual recognition.
Apple GUI's have underlying accessibility annotations that if surfaced would make UI manipulation easy for LLM's.
"Back in the day" - 1990's - Apple had Virtual User, basically a lisp derivative that reported UI state as S-expressions (like a web DOM) and allowed scripts to manipulate settings and perform UI actions.
With such a curated DOM/model and selective UI inputs, they could manage privacy and safety, opening up LLM control to users who would otherwise never trust a machine.
I hope they're working on that approach and training models for it. It's one way they could distinguish the Apple platform as being more controllable, with safety and permissions built into the subsystems instead of giving the LLM full control over UI input.
> I'm disappointed that they are taking the long way around, with screen shots and visual recognition.
This strikes me as more of a universal fallback vs. Apple choosing vision instead of a structured control plane. It nicely complements the layers Apple has been building for years: App Intents, Shortcuts, Spotlight/Siri surfaces, etc. Those are essentially curated action graphs with explicit parameters, validation, and user consent, which is much closer to your "DOM with safety rails" ideal.
All iOS app developers should now be building "App Intents first". Vision-based awareness is a nice safely for users of apps whose devs who haven't yet realized where this is all obviously going.
I strongly agree that accessibility/programmatic UI control is the way.
But also: app builders are never going to get in line. UI will incessantly produce novel new spins. And widgets.
Yes the system should demand those have good DOM like expressions, be good components.
But I also feel like using vision processing a pretty direct way to work around making the better world, and while I wish we could make that better orderly world, I think there's something practical and real here.
I recently experimented with Apple's Foundation Models framework, and I came away impressed at the speed and accuracy of the LLM. You can't ask it to build you a web app, but it can reliably translate a written instruction into tool use within your native app. I think there's a lot of merit to Apple's approach, using specialist tiny models like Ferret-UI Lite, though I don't think we'll see the full fruits of their labor for another year or two.
But it's a vision that I can get behind, where basic tasks like transcription, computer use, in-app tool, image understanding, etc, are local, secure and private.
direct to paper, https://arxiv.org/pdf/2509.26539
I'm disappointed that they are taking the long way around, with screen shots and visual recognition.
Apple GUI's have underlying accessibility annotations that if surfaced would make UI manipulation easy for LLM's.
"Back in the day" - 1990's - Apple had Virtual User, basically a lisp derivative that reported UI state as S-expressions (like a web DOM) and allowed scripts to manipulate settings and perform UI actions.
With such a curated DOM/model and selective UI inputs, they could manage privacy and safety, opening up LLM control to users who would otherwise never trust a machine.
I hope they're working on that approach and training models for it. It's one way they could distinguish the Apple platform as being more controllable, with safety and permissions built into the subsystems instead of giving the LLM full control over UI input.
I'd be very interested to learn about output quality vs token utilization for both these approaches
> I'm disappointed that they are taking the long way around, with screen shots and visual recognition.
This strikes me as more of a universal fallback vs. Apple choosing vision instead of a structured control plane. It nicely complements the layers Apple has been building for years: App Intents, Shortcuts, Spotlight/Siri surfaces, etc. Those are essentially curated action graphs with explicit parameters, validation, and user consent, which is much closer to your "DOM with safety rails" ideal.
All iOS app developers should now be building "App Intents first". Vision-based awareness is a nice safely for users of apps whose devs who haven't yet realized where this is all obviously going.
I strongly agree that accessibility/programmatic UI control is the way.
But also: app builders are never going to get in line. UI will incessantly produce novel new spins. And widgets.
Yes the system should demand those have good DOM like expressions, be good components.
But I also feel like using vision processing a pretty direct way to work around making the better world, and while I wish we could make that better orderly world, I think there's something practical and real here.