I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.
Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.
I'm in a similar boat and getting customers to whitelist IPs is always a big ask. In the best case they call their "tech guy", in the worst case it's a department far away and it has to go through 3 layers of reviews for someone to adapt some Cloudflare / Akamai rules.
And then you better make sure your IP is stable and a cloud provider isn't changing any IP assignments in the future, where you'll then have to contact all your clients again with that ask.
They're mostly non-technical/marketing people, but yes that would be a solution. I try to solve the issue "behind the scenes" so for them it "just works", but that means building all of these extra measures.
Would it make sense to advertise to the more technical minded a discount if they set up an IP whitelist with a tutorial you could provide ? A discount in exchange for reduced costs to you ?
> the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar … mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries
I would like to register my hatred and contempt for what you do. I sincerely hope you suffer drastic consequences for your antisocial behavior.
Please elaborate, why exactly is it antisocial? Because Cloudflare decides who can or cant access a users website? When they specifically signed up for my service.
It intentionally circumvents the explicit desires of those who own the websites being exploited. It is nonconsensual. It says “fuck you, yes” to a clearly-communicated “please no”.
Cloudflare and Big Tech are primary contributors to the impairment and decline of the Internet commons for moats, control, and profit; you are upset at the wrong parties.
It's a pretty easy game to win as the blocker. If you receive too many 404s against pages that don't exist, just ban the IP for a month. Actually got the idea from a hackernews comment too. Also thinking that if you crawl too many pages you should get banned as well.
There's no point in playing tug of war against unethical actors, just ban them and be done with it.
I don't think it's an uncommon opinion to behave this way either, nor are the crawlers users I want to help in any capacity either.
The anti-bot stuff mentioned upthread is real, but at this scale per-domain politeness queuing also becomes a genuine headache. You end up needing to track crawl-delay directives per domain, rate-limit your outbound queues by host, and handle DNS TTL properly to avoid hammering a CDN edge that's mapping thousands of domains to the same IPs. Most crawlers that work fine at 100M pages break somewhere in that machinery at 1B+.
> spinning disks have been replaced by NVMe solid state drives with near-RAM I/O bandwidth
Am I missing something here? Even Optane is an order of magnitude slower than RAM.
Yes, under ideal conditions, SSDs can have very fast linear reads, but IOPS / latency have barely improved in recent years. And that's what really makes a difference.
Of course, compared to spinning disks, they are much faster, but the comparison to RAM seems wrong.
In fact, for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU, so VRAM needs to be used. That's how latency-sensitive some applications have become.
>for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU
That's not why. It's because RAM has a narrower bus than VRAM. If it was a matter of distance it'd just have greater latency, but that would still give you tons of bandwidth to play with.
It's not. It's narrow even between the CPU and RAM. That's just the way x86 is designed. Nvidia and AMD by contrast have the luxury of being able to rearchitect their single-board computers each generation as long as they honor the PCIe interface.
It is also true that having a 384-bit memory bus shared with the video card would necessitate a redesigned PCIe slot as well as an outrageous number of traces on the motherboard, though.
Nice work, but I feel like it's not required to use AWS for this. There are small hosting companies with specialized servers (50gbit shared medium for under 10$), you could probably do this under 100$ with some optimization.
I did some crawling on hetzner back in the day. They monitor traffic and make sure you don't automate publically available data retrieval. They send you an email telling you that they are concerned because you got the ip blacklisted. Funny thing is: They own the blacklist that they refer to.
Well the most important part seems to be glossed over and that’s the IP addresses. Many websites simply block /want to block anything that’s not google and is not a “real user”.
I was able to get 35k req/sec on a single node with Rust (custom http stack + custom html parser, custom queue, custom kv database) with obsessive optimization.
It's possible to scrape Bing size index (say 100B docs) each month with only 10 nodes, under 15k$.
Thought about making it public but probably no one would use it.
I run a small startup called SEOJuice, where I need to crawl a lot of pages all the time, and I can say that the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar, just to get access to any website. The bandwith and storage are the smallest cost factor.
Even though, in my case, users add their own domains, it's still took me quite a bit of time to reach 99% chance to crawl a website — with a mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries, otherwise I would get 403 immediately with no HTML being served.
Blocking seems really popular. I wonder if it coincides with stack overflow closing.
Can't your users just whitelist your IPs?
I'm in a similar boat and getting customers to whitelist IPs is always a big ask. In the best case they call their "tech guy", in the worst case it's a department far away and it has to go through 3 layers of reviews for someone to adapt some Cloudflare / Akamai rules.
And then you better make sure your IP is stable and a cloud provider isn't changing any IP assignments in the future, where you'll then have to contact all your clients again with that ask.
They're mostly non-technical/marketing people, but yes that would be a solution. I try to solve the issue "behind the scenes" so for them it "just works", but that means building all of these extra measures.
Would it make sense to advertise to the more technical minded a discount if they set up an IP whitelist with a tutorial you could provide ? A discount in exchange for reduced costs to you ?
> the biggest issue with crawling is the blocking part and how much you need to invest to circumvent Cloudflare and similar … mix of residential proxies, captcha solvers, rotating user-agents, stealth chrome binaries
I would like to register my hatred and contempt for what you do. I sincerely hope you suffer drastic consequences for your antisocial behavior.
Please elaborate, why exactly is it antisocial? Because Cloudflare decides who can or cant access a users website? When they specifically signed up for my service.
It intentionally circumvents the explicit desires of those who own the websites being exploited. It is nonconsensual. It says “fuck you, yes” to a clearly-communicated “please no”.
OP literally said that users add their domains, meaning they are explicitly ASKING OP to scrape their websites.
Users sign up for my service.
You employ residential proxies. As such, you enable and exploit the ongoing destruction of the Internet commons. Enjoy the money!
This is kind of like getting upset with people who go to ATMs because drug dealers transact in cash lol.
Cloudflare and Big Tech are primary contributors to the impairment and decline of the Internet commons for moats, control, and profit; you are upset at the wrong parties.
Just stop scraping. I'll do everything to block you.
> in my case, users add their own domains
Seems like they're only scraping websites their clients specifically ask them to
Now you've gamified it :)
It's a pretty easy game to win as the blocker. If you receive too many 404s against pages that don't exist, just ban the IP for a month. Actually got the idea from a hackernews comment too. Also thinking that if you crawl too many pages you should get banned as well.
There's no point in playing tug of war against unethical actors, just ban them and be done with it.
I don't think it's an uncommon opinion to behave this way either, nor are the crawlers users I want to help in any capacity either.
What is the crawler is using a shared IP and you end up blocking legitimate users with the bad actor?
The anti-bot stuff mentioned upthread is real, but at this scale per-domain politeness queuing also becomes a genuine headache. You end up needing to track crawl-delay directives per domain, rate-limit your outbound queues by host, and handle DNS TTL properly to avoid hammering a CDN edge that's mapping thousands of domains to the same IPs. Most crawlers that work fine at 100M pages break somewhere in that machinery at 1B+.
> spinning disks have been replaced by NVMe solid state drives with near-RAM I/O bandwidth
Am I missing something here? Even Optane is an order of magnitude slower than RAM.
Yes, under ideal conditions, SSDs can have very fast linear reads, but IOPS / latency have barely improved in recent years. And that's what really makes a difference.
Of course, compared to spinning disks, they are much faster, but the comparison to RAM seems wrong.
In fact, for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU, so VRAM needs to be used. That's how latency-sensitive some applications have become.
>for applications like AI, even using system RAM is often considered too slow, simply because of the distance to the GPU
That's not why. It's because RAM has a narrower bus than VRAM. If it was a matter of distance it'd just have greater latency, but that would still give you tons of bandwidth to play with.
You could be charitable and say the bus is narrow because it has to travel a long distance and this makes it hard to have a lot of traces.
It's not. It's narrow even between the CPU and RAM. That's just the way x86 is designed. Nvidia and AMD by contrast have the luxury of being able to rearchitect their single-board computers each generation as long as they honor the PCIe interface.
It is also true that having a 384-bit memory bus shared with the video card would necessitate a redesigned PCIe slot as well as an outrageous number of traces on the motherboard, though.
ThreadRipper has 8 memory channels versus 2 for a desktop AMD CPU. It's not an x86 limitation.
"x86" as in the computer architecture, not the ISA. Why do you think they put extra channels instead of just having a single 512-bit bus?
Nice work, but I feel like it's not required to use AWS for this. There are small hosting companies with specialized servers (50gbit shared medium for under 10$), you could probably do this under 100$ with some optimization.
I did some crawling on hetzner back in the day. They monitor traffic and make sure you don't automate publically available data retrieval. They send you an email telling you that they are concerned because you got the ip blacklisted. Funny thing is: They own the blacklist that they refer to.
This. AWS is like a cash furnace, only really usable for VC backed efforts with more money than sense.
> because redis began to hit 120 ops/sec and I’d read that any more would cause issues
Suspicious. I don’t think I’ve ever read anything that says redis taps out below tens of thousands of ops…
Well the most important part seems to be glossed over and that’s the IP addresses. Many websites simply block /want to block anything that’s not google and is not a “real user”.
When I read this, I realize how small Google makes the Internet.
I was able to get 35k req/sec on a single node with Rust (custom http stack + custom html parser, custom queue, custom kv database) with obsessive optimization. It's possible to scrape Bing size index (say 100B docs) each month with only 10 nodes, under 15k$.
Thought about making it public but probably no one would use it.
please do
Yes! Please do!
There was a time when being able to do this meant you were on the path to becoming a (m)(b)illionaire. Still is, I think.
> I also truncated page content to 250KB before passing it to the parser.
WTF did I just read?