I'm glad this worked for Simon, but I would probably prefer using a User Script that scrapes DOM text changes and streams them to a small local web server to append to a JSONL file that has the URL, text change and timestamp. Probably since I already have something doing this, it allows me to do a backup of things I'm looking at in real time, like streaming LLM generations, and it just relies on normal browser technology. I should probably share my code since it's quite useful.
I'm a bit uncomfortable relying on a LLM to transcribe something where there is a stream of text that could be used in a robust way, and with real data, Vs well trained but indirect token magic. A middle ground might be to have grounded extraction and evidence chains, with timestamps, screenshots, cropped regions it's sourcing from, spelled out reasoning.
There's the extraction / retrieval step and there's a kind of data normalisation. Of course, it's nice that he's got something that just works with two or three steps, it's good the technology is getting quite reliable and cheap a lot of the time, but still, we could do better.
The userscript idea is great, I could think of some uses for this, such as text-to-speech for live comments. Do you know of any examples of projects already doing this?
Being Google, isn't it highly likely that the price is a loss leader which will later be changed once customers are sufficiently locked in? I get that this is more convenient than doing it programatically or manually, but that seems like a reason to using something other than gmail. This approach just seems incredibly wasteful to me.
Video scraping doesn’t need to be just screen captures. I’ve demoed a solution with Gemini where you take a video walking up and down aisles in a retail store and it captured 100% accurate data on product name, quantity/size, sku, and price for a little under 75% of the products. And that was back in January.
This has huge implications for everything from competitive pricing, to understanding store layouts, to creating your own grocery store inflation monitor. Just subtly take a video and process it.
Pretty sure there's some strong preprocessing bring applied to that video though. Maybe even to the point of extracting text and deduplicating it between frames.
You've got me thinking. Would this work for real estate data? A lot of sites make it quite hard to grab their raw data.
Also, perhaps it could gain some insights from the photos...
Something I really like about this technique is that I stay in complete control of what I expose to the model. If I don't want something fed into the model I omit it from the screen recording.
I admit this is a pretty cool technique, but what is missing is how accurate the data extraction was. Without knowing that it is not possible to judge how useful this technique is.
> You should never trust these things not to make mistakes, so I re-watched the 35 second video and manually checked the numbers. It got everything right.
I'm glad this worked for Simon, but I would probably prefer using a User Script that scrapes DOM text changes and streams them to a small local web server to append to a JSONL file that has the URL, text change and timestamp. Probably since I already have something doing this, it allows me to do a backup of things I'm looking at in real time, like streaming LLM generations, and it just relies on normal browser technology. I should probably share my code since it's quite useful. I'm a bit uncomfortable relying on a LLM to transcribe something where there is a stream of text that could be used in a robust way, and with real data, Vs well trained but indirect token magic. A middle ground might be to have grounded extraction and evidence chains, with timestamps, screenshots, cropped regions it's sourcing from, spelled out reasoning. There's the extraction / retrieval step and there's a kind of data normalisation. Of course, it's nice that he's got something that just works with two or three steps, it's good the technology is getting quite reliable and cheap a lot of the time, but still, we could do better.
I did something a bit like that recently to scrape tweets out of Twitter: https://til.simonwillison.net/twitter/collecting-replies
The userscript idea is great, I could think of some uses for this, such as text-to-speech for live comments. Do you know of any examples of projects already doing this?
Being Google, isn't it highly likely that the price is a loss leader which will later be changed once customers are sufficiently locked in? I get that this is more convenient than doing it programatically or manually, but that seems like a reason to using something other than gmail. This approach just seems incredibly wasteful to me.
Couldn't he have sent it as a fax and then photographed the fax ?
Video scraping doesn’t need to be just screen captures. I’ve demoed a solution with Gemini where you take a video walking up and down aisles in a retail store and it captured 100% accurate data on product name, quantity/size, sku, and price for a little under 75% of the products. And that was back in January.
This has huge implications for everything from competitive pricing, to understanding store layouts, to creating your own grocery store inflation monitor. Just subtly take a video and process it.
And the models have only gotten better.
> This has huge implications for everything from competitive pricing, to understanding store layouts
Even smaller stores have been monitoring their competitors since a long time.
> your own grocery store inflation monitor
You could also check your itemized bill.
Still amazed that video is so "cheap" on tokens despite being way more bytes than text
Pretty sure there's some strong preprocessing bring applied to that video though. Maybe even to the point of extracting text and deduplicating it between frames.
You've got me thinking. Would this work for real estate data? A lot of sites make it quite hard to grab their raw data. Also, perhaps it could gain some insights from the photos...
I'm certain it would. That would be a really fun experiment to run!
Could also work for social media which can be hard to scrape
I think this sort of thing is what Microsoft intended with Recall. The problem is the privacy implications are horrible.
Something I really like about this technique is that I stay in complete control of what I expose to the model. If I don't want something fed into the model I omit it from the screen recording.
I admit this is a pretty cool technique, but what is missing is how accurate the data extraction was. Without knowing that it is not possible to judge how useful this technique is.
I watched the 35 second long video and confirmed by eyeballing the JSON that the result was exactly correct.
> You should never trust these things not to make mistakes, so I re-watched the 35 second video and manually checked the numbers. It got everything right.
He said in his tweet that he verified the results.
This is gonna push things towards some very unfortunate DRM.
Can't stop the webcam->LLM
Poison pixels and other cat-and-mouse things will definitely happen.