i would probably use Playwright with custom code, create chunks based on similar products, then run it on a large cluster in parallel (https://github.com/Burla-Cloud/burla).
if you have a single worker trying to scrape a shit ton of products back to back to back you're going to get rate limited or their bot detection will catch you.
Well the website is kind of useless, but it does suck me in. I love reading crazy reviews. The only thing that would make it better is if they also included Airbnb reviews.
The second review I read was a customer complaining about profanity in a movie and then writing out all the examples. Who has time for that?
I must say the reviews you have are more in the horrifying and less in the pretty funny situation. My favorite funny (and bad) review was a host that accused his guest of flipping over all the furniture in the house and the guest was like "why and how would I do this". I still want to know what happened that day. How did all the furniture end up upside down?
I'm going to publish an Airbnb example tomorrow where I scraped 1,406,718 photo URLs from public listing pages. For that I used https://docs.burla.dev/ which is a high-performance parallel processing python library I've been working on for a few years now.
- i saw your other comment that talks about using an open source dataset but i had to ask
- how would you actually go about loading reviews if you really wanted to
- what kind of system would you need to work around the captcha and stuff
i would probably use Playwright with custom code, create chunks based on similar products, then run it on a large cluster in parallel (https://github.com/Burla-Cloud/burla).
if you have a single worker trying to scrape a shit ton of products back to back to back you're going to get rate limited or their bot detection will catch you.
Well the website is kind of useless, but it does suck me in. I love reading crazy reviews. The only thing that would make it better is if they also included Airbnb reviews.
The second review I read was a customer complaining about profanity in a movie and then writing out all the examples. Who has time for that?
well well well... take a look at what I just built https://burla-cloud.github.io/airbnb-burla/
I must say the reviews you have are more in the horrifying and less in the pretty funny situation. My favorite funny (and bad) review was a host that accused his guest of flipping over all the furniture in the house and the guest was like "why and how would I do this". I still want to know what happened that day. How did all the furniture end up upside down?
I love it! Endless entertainment and 0 attempts to get me to stay at the Airbnb.
yeah now that I have the images I want to do some silly shit with it. maybe find the all Airbnbs with satanic decor or like red rooms haha
I love this. The reviews' word play tops MacBeth in my book.
i'm just happy they don't censor the comment section haha, makes for funny content.
i also love that people will complain about the vulgar language in a book or movie by writing a review that contains a quote with the vulgar language
how did you scrape all the reviews?
open source dataset from McAuley Lab at UCSD https://huggingface.co/datasets/McAuley-Lab/Amazon-Reviews-2....
I'm going to publish an Airbnb example tomorrow where I scraped 1,406,718 photo URLs from public listing pages. For that I used https://docs.burla.dev/ which is a high-performance parallel processing python library I've been working on for a few years now.
Shit like this is why Amazon reviews are now behind a login wall for everyone.