Using Cloudflare on your website could be blocking RSS users

(openrss.org)

345 points | by campuscodi 18 hours ago ago

156 comments

wenbin 2 hours ago

At Listen Notes, we rely heavily on Cloudflare to manage and protect our services, which cater to both human users and scripts/bots.

One particularly effective strategy we've implemented is using separate subdomains for services designed for different types of traffic, allowing us to apply customized firewall and page rules to each subdomain.

For example:

- www. listennotes.com is dedicated to human users. E.g., https://www.listennotes.com/podcast-realtime/

- feeds. listennotes.com is tailored for bots, providing access to RSS feeds. Eg., https://feeds.listennotes.com/listen/wenbin-fangs-podcast-pl...

- audio. listennotes.com serves both humans and bots, handling audio URL proxies. E.g., https://audio.listennotes.com/e/p/1a0b2d081cae4d6d9889c49651...

This subdomain-based approach enables us to fine-tune security and performance settings for each type of traffic, ensuring optimal service delivery.

amatecha 10 hours ago

I get blocked from websites with some regularity, running Firefox with strict privacy settings, "resist fingerprinting" etc. on OpenBSD. They just give a 403 Forbidden with no explanation, but it's only ever on sites fronted by CloudFlare. Good times. Seems legit.

[-]

wakeupcall 6 hours ago

Also running FF with strict privacy settings and several blockers. The annoyances are constantly increasing. Cloudflare, captchas, "we think you're a bot", constantly recurring cookie popups and absurd requirements are making me hate most of the websites and services I hit nowdays.

I tried for a long time to get around it, but now when I hit a website like this just close the tab and don't bother anymore.

[-]

afh1 5 hours ago

Same, but for VPN (either corporate or personal). Reddit blocks it completely, requires you to sign-in but even the sign-in page is "network restricted"; LinkedIn shows you a captcha but gives an error when submitting the result (several reports online); and overall a lot of 403's. All go magically away when turning off the VPN. Companies, specially adtechs like Reddit and LinkedIn, do NOT want you to browse privately, to the point they rather you don't use their website at all unless without a condom.

[-]

acdha 4 hours ago

> Companies, specially adtechs like Reddit and LinkedIn, do NOT want you to browse privately, to the point they rather you don't use their website at all unless without a condom.

That’s true in some cases, I’m sure, but also remember that most site owners deal with lots of tedious abuse. For example, some people get really annoyed about Tor being blocked but for most sites Tor is a tiny fraction of total traffic but a fairly large percentage of the abuse probing for vulnerabilities, guessing passwords, spamming contact forms, etc. so while I sympathize for the legitimate users I also completely understand why a busy site operator is going to flip a switch making their log noise go down by a double-digit percentage.

[-]

rolph 13 minutes ago

funny thing, when FF is blocked i can get through with TOR.

Adachi91 2 hours ago

> Reddit blocks it completely, requires you to sign-in but even the sign-in page is "network restricted";

I've been creating accounts every time I need to visit Reddit now to read a thread about [insert subject]. They do not validate E-Mail, so I just use `example@example.com`, whatever random username it suggests, and `example` as a password. I've created at least a thousand accounts at this point.

Malicious Compliance, until they disable this last effort at accessing their content.

anthk 3 hours ago

For Reddit I just use it r/o under gopher://gopherddit.com

A good client it's either Lagrange (multiplatform), the old Lynx or Dillo with the Gopher plugin.

appendix-rock 5 hours ago

I don’t follow the logic here. There seems to be an implication of ulterior motive but I’m not seeing what it is. What aspect of ‘privacy’ offered by a VPN do you think that Reddit / LinkedIn are incentivised to bypass? From a privacy POV, your VPN is doing nothing to them, because your IP address means very little to them from a tracking POV. This is just FUD perpetuated by VPN advertising.

However, the undeniable reality is that accessing the website with a non-residential IP is a very, very strong indicator of sinister behaviour. Anyone that’s been in a position to operate one of these services will tell you that. For every…let’s call them ‘privacy-conscious’ user, there are 10 (or more) nefarious actors that present largely the same way. It’s easy to forget this as a user.

I’m all but certain that if Reddit or LinkedIn could differentiate, they would. But they can’t. That’s kinda the whole point.

[-]

bo1024 4 hours ago

Not following what could be sinister about a GET request to a public website.

> From a privacy POV, your VPN is doing nothing to them, because your IP address means very little to them from a tracking POV.

I disagree. (1) Since I have javascript disabled, IP address is generally their next best thing to go on. (2) I don't want to give them IP address to correlate with the other data they have on me, because if they sell that data, now someone else who only has my IP address suddenly can get a bunch of other stuff with it too.

[-]

zahllos 2 hours ago

SQL injection?

Get parameters can be abused like any parameter. This could be sql, could be directory traversal attempts, brute force username attempts, you name it.

homebrewer 3 hours ago

It's equally easy to forget about users from countries with way less freedom of speech and information sharing than in Western rich societies. These anti-abuse measures have made it much more difficult to access information blocked by my internet provider during the last few years. I'm relatively competent and can find ways around it, but my friends and relatives who pursue other career choices simply don't bother anymore.

Telegram channels have been a good alternative, but even that is going downhill thanks to French authorities.

Cloudflare and Google also often treat us like bots (endless captchas, etc) which makes it even more difficult.

afh1 4 hours ago

IP address is a fingerprint to be shared with third parties, of course it's relevant. It's not ulterior motive, it's explicit, it's not caring about your traffic because you're not good product. They can and do differentiate by requiring a sign-in. They just don't care enough to make it actually work. Because they are adtechs and not interested in you as a user.

anilakar 4 hours ago

Heck, I cannot even pass ReCAPTCHA nowadays. No amount of clicking buses, bicycles, motorcycles, traffic lights, stairs, crosswalks, bridges and fire hydrants will suffice. The audio transcript feature is the only way to get past a prompt.

[-]

josteink 3 hours ago

Just a heads up that this is how Google treat connections it suspects to originate from bots. Silently keeping you in an endless loop promising reward if you can complete it correctly.

I discovered this when I set up IPv6 using hurricane electric as a tunnel broker for IPv6 connectivity.

Seemingly Google has all HEnet IPv6tunnel subnets listed for such behaviour without it being documented anywhere. It was extremely annoying until I figured out what was going on.

[-]

n4r9 3 hours ago

> Silently keeping you in an endless loop promising reward if you can complete it correctly.

Sounds suspiciously like how product managers talk to developers as well.

anilakar 2 hours ago

Sadly my biggest crime is running Firefox with default privacy settings and uBlock Origin installed. No VPNs or IPv6 tunnels, no Tor traffic whatsoever, no Google search history poisoning plugins.

If only there was a law that allowed one to be excluded from automatic behavior profiling...

orbisvicis 4 hours ago

I have to solve captchas for Amazon while logged into my Amazon account.

amanda99 2 hours ago

Yes and the most infuriating thing is the "we need to verify the security of your connection" text.

JohnFen 2 hours ago

> when I hit a website like this just close the tab and don't bother anymore.

Yeah, that's my solution as well. I take those annoyances as the website telling me that they don't want me there, so I grant them their wish.

lioeters 5 hours ago

Same here. I occasionally encounter websites that won't work with ad blockers, sometimes with Cloudflare involved, and I don't even bother with those sites anymore. Same with sites that display a cookie "consent" form without an option to not accept. I reject the entire site.

Site owners probably don't even see these bounced visits, and it's such a tiny percentage of visitors who do this that it won't make a difference. Meh, it's just another annoyance to be able to use the web on our own terms.

neilv 4 hours ago

Similar here. It's not unusual to be blocked from a site by CloudFlare when I'm running Firefox (either ESR or current release) on Linux.

I suspect that people operating Web sites have no idea how many legitimate users are blocked by CloudFlare.

And. based on the responses I got when I contacted two of the companies whose sites were chronically blocked by CloudFlare for months, it seemed like it wasn't worth any employee's time to try to diagnose.

Also, I'm frequently blocked by CloudFlare when running Tor Browser. Blocking by Tor exit node IP address (if that's what's happening) is much more understandable than blocking Firefox from a residential IP address, but still makes CloudFlare not a friend of people who want or need to use Tor.

[-]

jorams 2 hours ago

> I suspect that people operating Web sites have no idea how many legitimate users are blocked by CloudFlare.

I sometimes wonder if all Cloudflare employees are on some kind of whitelist that makes them not realize the ridiculous false positive rate of their bot detection.

lovethevoid 13 minutes ago

What are some examples? I've been running ff on linux for quite some time now and am rarely blocked. I just run it with ublock origin.

pjc50 4 hours ago

> CloudFlare not a friend of people who want or need to use Tor

The adversarial aspect of all this is a problem: P(malicious|Tor) is much higher than P(malicious|!Tor)

amatecha 28 minutes ago

Yeah, I've contacted numerous owners of personal/small sites and they are usually surprised, and never have any idea why I was blocked (not sure if it's an aspect of CF not revealing the reason, or the owner not knowing how to find that information). One or two allowlisted my IP but that doesn't strike me as a solution.

I've contacted companies about this and they usually just tell me to use a different browser or computer, which is like "duh, really?" , but also doesn't solve the problem for me or anyone else.

BiteCode_dev 9 hours ago

Cloudflare is a fantastic service with an unmatched value proposition, but it's unfortunately slowly killing web privacy, with 1000s paper cuts.

Another problem is "resist fingerprinting" prevents some canvas processing, and many websites like bluesky, linked in or substack uses canvas to handle image upload, so your images appear to be stripes of pixel.

Then you have mobile apps that just don't run if you don't have a google account, like chatgpt's native app.

I understand why people give up, trying to fight for your privacy is an uphill battle with no end in sight.

[-]

pjc50 4 hours ago

The privacy battle has to be at the legal layer. GDPR is far from perfect (bureaucratic and unclear with weak enforcement), but it's a step in the right direction.

In an adversarial environment, especially with both AI scrapers and AI posters, websites have to be able to identify and ban persistent abusers. Which unfortunately implies having some kind of identification of everybody.

[-]

wbl an hour ago

You notice that Analogue Devices puts their (incredibly useful) information up for free. That's because they make money other ways. Ad supported content farm Internet had a nice run but we will get on without it.

BiteCode_dev 3 hours ago

That's another problem, we want cheap easy solutions like tracking people, instead of more targetteed or systemic ones.

nonameiguess 2 hours ago

No, it's more than that. Cloudflare's bot protection has blocked me from sites where I have a paid account, paid for by my real checking account with my real name attached. Even when I am perfectly willing to give out my identity and be tracked, I still can't because I can't even get to the login page.

madeofpalk 7 hours ago

> Then you have mobile apps that just don't run if you don't have a google account, like chatgpt's native app.

Is that true? At least on iOS you can log into the ChatGPT with same email/password as the website.

I never use Google login for stuff and ChatGPT works fine for me.

[-]

BiteCode_dev 7 hours ago

See other comment.

KomoD 7 hours ago

> Then you have mobile apps that just don't run if you don't have a google account, like chatgpt's native app.

That's not true, I use ChatGPT's app on my phone without logging into a Google account.

You don't even need any kind of account at all to use it.

[-]

BiteCode_dev 7 hours ago

On Android at least, even if you don't need to log in to your google account when connecting to chatgpt, the app won't work if your phone isn't signed in into google play, which doesn't work if your phone isn't linked to a google account.

An android phone asks you to link a google account when you use it for the first time. It takes a very dedicated user to refuse that, then to avoid logging in into the gmail, youtube or app store apps which will all also link your phone to your google account when you sign in.

But I do actively avoid this, I use Aurora, F-droid, K9 and NewPipeX, so no link to google.

But then no ChatGPT app. When I start it, I get hit with a logging page to the app store and it's game over.

[-]

__MatrixMan__ 3 hours ago

I have a similar experience with the pager duty app. It loads up and then exits with "security problem detected by app" because I've made it more secure by isolating it from Google (a competitor). Workaround is to just control it via slack instead.

[-]

BiteCode_dev 3 hours ago

Well you can use the web base chagpt so there is a workaround. Except it's worse a worse experience.

acdha 4 hours ago

So the requirement is to pass the phone’s system validation process rather than having a Google account. I don’t love that but I can understand why they don’t want to pay the bill for the otherwise ubiquitous bots, and it’s why it’s an Android-specific issue.

[-]

BiteCode_dev 3 hours ago

You can make a very rational case for each privacy invasive technical decision ever made.

In the end, the fact remain: no chatgpt app without giving up your privacy, to google none the less.

[-]

acdha 2 hours ago

“Giving up your privacy” is a pretty sweeping claim – it sounds like you’re saying that Android inherently leaks private data to Google, which is broader than even Apple fans tend to say.

[-]

michaelt 17 minutes ago

A person who was maximally distrustful of Google would assume they link your phone and your IP through the connection used to receive push notifications, and the wifi-network-visibility-to-location API, and the software update checker, and the DNS over HTTPS, and suchlike. As a US company, they could even be forced to do this in secret against their will, and lie about it.

Of course as Google doesn't claim they do this, many people would consider it unreasonably fearful/cynical.

BiteCode_dev 15 minutes ago

Google and Apple were both part of the PRISM program, of course I'm making this claim.

That's the opposite stance that would be bonkers.

ForHackernews 4 hours ago

You might like: https://e.foundation/e-os/

[-]

BiteCode_dev 3 hours ago

That won't make chatgpt's app work thought.

[-]

ForHackernews 2 hours ago

It might well do, depending on what ChatGPT's app is asking the OS for. /e/OS is an Android fork that removes Google services and replaces them with open source stubs/re-implementations from https://microg.org/

I haven't tried the ChatGPT app, but I know that, for example my bank and other financial services apps work with on-device fingerprint authentication and no Google account on /e/OS.

mzajc 5 hours ago

I randomize my User-Agent header and many websites outright block me, most often with no captcha and no useless error message.

The most egregious is Microsoft (just about every Microsoft service/page, really), where all you get is a "The request is blocked." and a few pointless identifiers listed at the bottom, purely because it thinks your browser is too old.

CF's captcha page isn't any better either, usually putting me in an endless loop if it doesn't like my User-Agent.

[-]

lovethevoid 18 minutes ago

Not sure a random UA extension is giving you much privacy. Try your results on coveryourtracks eff, and see. A random UA would provide a lot of identifying information despite being randomized.

From experience, a lot of the things people do in hopes of protecting their privacy only makes them far easier to profile.

charrondev 5 hours ago

Are you sending an actual random string as your UA or sending one of a set of actual user agents?

You’re best off just picking real ones. We’ve got hit by a botnet sending 10k+ requests from 40 different ASNs with 1000s of different IPs. The only way we’re able to identify/block the traffic was excluding user agents matching some regex (for whatever reason they weren’t spoofing real user agents but weren’t sending actual ones either).

[-]

RALaBarge 4 hours ago

I worked at an anti-spam email security company in the aughts, and we had a perl engine that would rip apart the MIME boundaries and measure everything - UA, SMTP client fingerprint headers, even the number of anchor or paragraph tags. A large combination of IF/OR evaluations with a regex engine did a pretty good job since the botnets usually don't bother to fully randomize or really opsec the payloads they are sending since it is a cannon instead of a flyswatter.

[-]

kccqzy 3 hours ago

Similar techniques are known in the HTTP world too. There were things like detecting the order of HTTP request headers and matching them to known software, or even just comparing the actual content of the Accept header.

mzajc 3 hours ago

I use the Random User-Agent Switcher[1] extension on Firefox. It does pick real agents, but some of them might show a really outdated browser (eg. Firefox 5X), which I assume is the reason I'm getting blocked.

[1]: https://addons.mozilla.org/en-US/firefox/addon/random_user_a...

pushcx 4 hours ago

Rails is going to make this much worse for you. All new apps include naive agent sniffing and block anything “old” https://github.com/rails/rails/pull/50505

[-]

mzajc 3 hours ago

This is horrifying. What happened to simply displaying a "Your browser is outdated, consider upgrading" banner on the website?

[-]

whoopdedo 38 minutes ago

The irony being you can get around the block by pretending to be a bot.

https://github.com/rails/rails/pull/52531

shbooms 2 hours ago

idk, even that seems too much to me, but maybe I'm just being too senstive.

but like, why is it a website's job to tell me what browser version to use? unless my outdated browser is lacking legitmate functionality which is required by your website, just serve the page and be done with it.

freedomben 2 hours ago

Wow. And this is now happening right as I've blacklisted google-chrome due to manifest v3 removal :facepalm:

GoblinSlayer 4 hours ago

  def blocked?
    user_agent_version_reported? && unsupported_browser?
  end

well, you know what to do here :)

DrillShopper 9 minutes ago

Maybe after the courts break up Amazon the FTC can turn its eye to Cloudflare.

[-]

gjsman-1000 6 minutes ago

A. Do you think courts give a darn about the 0.1% of users that are still using RSS? We might as well care about the 0.1% of users who want the ability to set every website's background color to purple with neon green anchor tags. RSS never caught on as a standard to begin with, peaking at 6% adoption by 2005.

B. Cloudflare has healthy competition with AWS, Akamai, Fastly, Bunny.net, Mux, Google Cloud, Azure, you name it, there's a competitor. This isn't even an Apple vs Google situation.

Jazgot 8 hours ago

My rss reader was blocked on kvraudio.com by cloudflare. This issue wasn't solved for months. I simply stopped reading anything on kvraudio. Thank you cloudflare!

anthk 3 hours ago

Or any Dillo user, with a PSP User Agent which is legit for small displays.

pessimizer 3 hours ago

Also, Cloudflare won't let you in if you forge your referer (it's nobody's business what site I'm coming from.) For years, you could just send the root of the site you were visiting, then last year somebody at Cloudflare flipped a switch and took a bite out of everyone's privacy. Now it's just endless reloading captchas.

[-]

zamadatix 3 hours ago

Why go through that hassle instead of just removing the referer?

[-]

bityard 13 minutes ago

Lots of sites see an empty referrer and send you to their main page or marketing page. Which means you can't get anywhere else on their site without a valid referrer. They consider it a form of "hotlink" protection.

(I'm not saying I agree with it, just that it exists.)

anal_reactor 5 hours ago

On my phone Opera Mobile won't be allowed into some websites behind CloudFlare, most importantly 4chan

[-]

dialup_sounds 3 hours ago

4chan's CF config is so janky at this point it's the only site I have to use a VPN for.

jasonlotito 2 hours ago

Cloudflare has always been a dumpster fire in usability. The number of times it would block me in that way was enough to make me seriously question anyones technical knowledge that used it. It's a dumpster fire. Friends don't let friend use Cloudflare. To me, it's like the Spirit airlines of CDNs.

Sure, tech wise it might work great, but from your users perspective: it's trash.

viraptor 8 hours ago

I know it's not a solution for you specifically here, but if anyone has access to the CF enterprise plan, they can report specific traffic as non-bot and hopefully improve the situation. They need to have access to the "Bot Management" feature though. It's a shitty situation, but some of us here can push back a little bit - so do it if you can.

And yes, it's sad that the "make internet work again" is behind an expensive paywall..

[-]

meeby 6 hours ago

The issue here is that RSS readers are bots. Obviously perfectly sensible and useful bots, but they’re not “real people using a browser”. I doubt you could get RSS readers listed on Cloudflare’s “good bots” list either which would allow them the default bot protection feature given they’ll all run off random residential IPs.

[-]

j16sdiz 5 hours ago

They can't whitelist useragent, otherwise bot will pass just using agent spoofing.

If you have enterprise plan, you can have custom rules including allowing by url

sam345 4 hours ago

Not sure if I get this.It seems to me an RSS reader is as much of a bot as a browser is for HTML. It just reads RSS rather than HTML.

[-]

kccqzy 3 hours ago

The difference is that RSS readers usually do background fetches on their own rather than waiting for a human to navigate to a page. So in theory, you could just set up a crontab (or systemd timer) that simply xdg-open various pages on a schedule and not be treated as bots.

kevincox 16 hours ago

I dislike advice of whitelisting specific readers by user-agent. Not only is this endless manual work that will only solve the problem for a subset of users but it also is easy to bypass by malicious actors. My recommendation would be to create a page rule that disables bot blocking for your feeds. This will fix the problem for all readers with no ongoing maintenance.

If you are worried about DoS attacks that may hammer on your feeds then you can use the same configuration rule to ignore the query string for cache keys (if your feed doesn't use query strings) and overriding the caching settings if your server doesn't set the proper headers. This way Cloudflare will cache your feed and you can serve any number of visitors without putting load onto your origin.

As for Cloudflare fixing the defaults, it seems unlikely to happen. It has been broken for years, Cloudflare's own blog is affected. They have been "actively working" on fixing it for at least 2 years according to their VP of product: https://news.ycombinator.com/item?id=33675847

[-]

benregenspan 4 hours ago

AI crawlers have changed the picture significantly and in my opinion are a much bigger threat to the open web than Cloudflare. The training arms race has drastically increased bot traffic, and the value proposition behind that bot traffic has inverted. Previously many site operators could rely on the average automated request being net-beneficial to the site and its users (outside of scattered, time-limited DDoS attacks) but now most of these requests represent value extraction. Combine this with a seemingly related increase in high-volume bots that don't respect robots.txt and don't set a useful User-Agent, and using a heavy-handed firewall becomes a much easier business decision, even if it may target some desirable traffic (like valid RSS requests).

vaylian 9 hours ago

I don't know if cloudflare offers it, but whitelisting the URL of the RSS feed would be much more effective than filtering user agents.

[-]

jks 2 hours ago

Yes, you can do it with a "page rule", which the parent comment mentioned. The CloudFlare free tier has a budget of three page rules, which might mean that you have to bundle all your rss feeds in one folder so they share a path prefix.

derkades 9 hours ago

Yes it supports it, and I think that's what the parent comment was all about

[-]

BiteCode_dev 9 hours ago

Specifically, whitelisting the URL for the bot protection, but not the cache, so that you are still somewhat protected against adversarial use.

[-]

londons_explore 5 hours ago

An adversary can easily send no-cache headers to bust the cache.

[-]

acdha 4 hours ago

The CDN can choose whether to honor those. That hasn’t been an effective adversarial technique since the turn of the century.

[-]

londons_explore 2 hours ago

does cloudflare give such an option? Even for non-paid accounts?

a-french-anon 6 hours ago

And for those of us using sfeed, the default UA is Curl's.

jgrahamc 8 hours ago

My email is jgc@cloudflare.com. I'd like to hear from the owners of RSS readers directly on what they are experiencing. Going to ask team to take a closer look.

[-]

kalib_tweli 7 hours ago

There are email obfuscation and managed challenge script tags being injected into the RSS feed.

You simply shouldn't have any challenges whatsoever on an RSS feed. They're literally meant to be read by a machine.

[-]

kalib_tweli 6 hours ago

I confirmed that if you explicitly set the Content-Type response header to application/rss+xml it seems to work with Cloudflare Proxy enabled.

The issue here is that Cloudflare's content type check is naive. And the fact that CF is checking the content-type header directly needs to be made more explicit OR they need to do a file type check.

[-]

londons_explore 5 hours ago

I wonder if popular software for generating RSS feeds might not be setting the correct content-type header? Maybe this whole issue could be mostly-fixed by a few github PR's...

[-]

onli 4 hours ago

Correct might be debatable here as well. My blog for example sets Content-Type to text/xml, which is not exactly wrong for an RSS feed (after all, it is text and XML) and IIRC was the default back then.

There were compatibility issues with other type headers, at least in the past.

[-]

johneth 2 hours ago

I think the current correct content types are:

'application/rss+xml' (for RSS)

'application/atom+xml' (for Atom)

[-]

londons_explore 2 hours ago

Sounds like a kind samaritan could write a scanner to find as many RSS feeds as possible which look like RSS/Atom and don't have these content types, then go and patch the hosting software those feeds use to have the correct content types, or ask the webmasters to fix it if they're home-made sites.

As soon as a majority of sites use the correct types, clients can start requiring it for newly added feeds, which in turn will make webmasters make it right if they want their feed to work.

djbusby 4 hours ago

The number of feeds with crap headers and other non-spec stuff going on; and loads of clients missing useful headers. Ugh. It seems like it should be simple; maybe that's why there are loads of naive implementations.

kalib_tweli 4 hours ago

It wouldn't. It's the role of the HTTP server to set the correct content type header.

viraptor 8 hours ago

It's cool and all that you're making an exception here, but how about including a "no, really, I'm actually a human" link on the block page rather than giving the visitor a puzzle: how to report the issue to the page owner (hard on its own for normies) if you can't even load the page. This is just externalising issues that belong to the Cloudflare service.

[-]

jgrahamc 8 hours ago

I am not trying to "make an exception", I'm asking for information external to Cloudflare so I can look at what people are experiencing and compare with what our systems are doing and figure out what needs to improve.

[-]

PaulRobinson 7 hours ago

Some "bots" are legitimate. RSS is intended for machine consumption. You should not be blocking content intended for machine consumption because a machine is attempting to consume it. You should not expect a machine, consuming content intended for a machine, to do some sort of step to show they aren't a machine, because they are in fact a machine. There is a lot of content on the internet that is not used by humans, and so checking that humans are using it is an aggressive anti-pattern that ruins experiences for millions of people.

It's not that hard. If the content being requested is RSS (or Atom, or some other syndication format intended for consumption by software), just don't do bot checks, use other mechanisms like rate limiting if you must stop abuse.

As an example: would you put a captcha on robots.txt as well?

As other stories here can attest to, Cloudflare is slowly killing off independent publishing on the web through poor product management decisions and technology implementations, and the fix seems pretty simple.

[-]

jamespo 4 hours ago

From another post, if the content-type is correct it gets through. If this is the case I don't see the problem.

robertlagrant 7 hours ago

This is useful info: https://news.ycombinator.com/item?id=33675847

methou 8 hours ago

Some clients are more like a bot/service, imagine google reader that fetches and caches content for you. The client I’m currently using is miniflux, it also works in this way.

I understand that there are some more interactive rss readers, but from personal experience it’s more like “hey I’m a good bot, let me in”

[-]

_Algernon_ 7 hours ago

An rss reader is a user agent (ie. a software acting on behalf of its users). If you define rss readers as a bot (even if it is a good bot), you may as well call Firefox a bot (it also sends off web requests without explicit approval of each request by the browser).

[-]

sofixa 7 hours ago

Their point was that the RSS reader does the scraping on its own in the background, without user input. If it can't read the page, it can't; it's not initiated by the user where the user can click on a "I'm not a bot, I promise" button.

viraptor 7 hours ago

It was a mental skip, but the same idea. It would awesome if CF just allowed reporting issues at the point something gets blocked - regardless if it's a human or a bot. They're missing an "I'm misclassified" button for people actually affected without the third-party runaround.

[-]

fluidcruft 3 hours ago

Unfortunately, I would expect that queue of reports to get flooded by bad faith actors.

badlibrarian 3 hours ago

Thank you for showing up here and being open to feedback. But I have to ask: shouldn't Cloudflare be running and reviewing reports to catch this before it became such a problem? It's three clicks in Tableau for anyone who cares, and clearly nobody does. And this isn't the first time something like this has slipped through the cracks.

I tried reaching out to Cloudflare with issues like this in the past. The response is dozens of employees hitting my LinkedIn page yet no responses to basic, reproduceable technical issues.

You need to fix this internally as it's a reputational problem now. Less screwing around using Salesforce as your private Twitter, more leadership in triage. Your devs obviously aren't motivated to fix this stuff independently and for whatever reason they keep breaking the web.

[-]

015a 2 hours ago

The reality that HackerNews denizens need to accept, in this case and in a more general form, is: RSS feeds are not popular. They aren't just unpopular in the way that, say, Peacock is unpopular relative to Netflix; they're truly unpopular, used regularly by a number of people that could fit in an american football stadium. There are younger software engineers at Cloudflare that have never heard the term "RSS" before, and have no notion of what it is. It will probably be dead technology in ten years.

I'm not saying this to say its a good thing; it isn't.

Here's something to consider though: Why are we going after Cloudflare for this? Isn't the website operator far, far more at-fault? They chose Cloudflare. They configure Cloudflare. They, in theory, publish an RSS feed, which is broken because of infrastructure decisions they made. You're going after Ryobi because you've got a leaky pipe. But beyond that: isn't this tool Cloudflare publishes doing exactly what the website operators intended it to do? It blocks non-human traffic. RSS clients are non-human traffic. Maybe the reason you don't want to go after the website operators is because you know you're in the wrong? Why can't these RSS clients detect when they encounter this situation, and prompt the user with a captive portal to get past it?

[-]

badlibrarian 2 hours ago

I'm old enough to remember Dave Winer taking Feedburner to task for inserting crap into RSS feeds that broke his code.

There will always be niche technologies and nascent standards and we're taking Cloudflare to task today because if they continue to stomp on them, we get nowhere.

"Don't use Cloudflare" is an option, but we can demand both.

is_true 4 hours ago

Maybe when you detect urls that return the rss mimetype notify the owner of the site/CF account that it might be a good idea to allow bots on that urls.

Ideally you could make it a simple switch in the config, somethin like: "Allow automated access on RSS endpoints".

prmoustache 6 hours ago

It is not only rss reader users that are affected. Any user with some extension to block trackers get regularly forbidden access to websites or have to deal with tons of captcha.

kevincox 4 hours ago

I'll mail you as well but I think public discussion is helpful. Especially since I have seem similar responses to this over the years and it feels very disingenuous. The problem is very clear (Cloudflare serves 403 blocks to feed readers for no reason) you have all of the logs. The solution is maybe not trivial but I fail to see how the perspective of someone seeing a 403 block is going to help much. This just starts to sound like a way to seem responsive without actually doing anything.

From the feed reader perspective it is a 403 response. For example my reader has been trying to read https://blog.cloudflare.com/rss/ and the last successful response it got was on 2021-11-17. It has been backing off due to "errors" but it still is checking every 1-2 weeks and gets a 403 every time.

This obviously isn't limited to the Cloudflare blog, I see it on many site "protected by" (or in this case broken by) Cloudflare. I could tell you what public cloud IPs my reader comes from or which user-agent it uses but that is besides the point. This is a URL which is clearly intended for bots so it shouldn't be bot-blocked by default.

When people reach out to customer support we tell them that this is a bug for the site and there isn't much we can do. They can try contacting the site owner but this is most likely the default configuration of Cloudflare causing problems that the owner isn't aware of. I often recommend using a service like FeedBurner to proxy the request as these services seem to be on the whitelist of Cloudflare and other scraping prevention firewalls.

I think the main solution would be to detect intended-for-robots content and exclude it from scraping prevention by default (at least to a huge degree).

Another useful mechanism would be to allow these to be accessed when the target page is cachable, as the cache will protect the origin from overload-type DoS attacks anyways. Some care needs to be taken to ensure that adding a ?bust={random} query parameter can't break through to the origin but this would be a powerful tool for endpoints that need protection from overload but not against scraping (like RSS feeds). Unfortunately cache headers for feeds are far from universal, so this wouldn't fix all feeds on its own. (For example the Cloudflare blog's feed doesn't set any caching headers and is labeled as `cf-cache-status: DYNAMIC`.)

erikrothoff 10 hours ago

As the owner of an RSS reader I love that they are making this more public. 30% of our support requests are ”my feed doesn’t” work. It sucks that the only thing we can say is ”contact the site owner, it’s their firewall”. And to be fair it’s not only Cloudflare, so many different firewall setups cause issues. It’s ironic that a public API endpoint meant for bots is blocked for being a bot.

butz 2 hours ago

Not "could" but it is actually blocking. Very annoying when government website does that, as usually it is next to impossible to explain the issue and ask for a fix. And even if the fix is made, it is reverted several weeks later. Other websites does that too, it was funny when one website was asking RSS reader to resolve captcha and prove they are human.

belkinpower 10 hours ago

I maintain an RSS reader for work and Cloudflare is the bane of my existence. Tons of feeds will stop working at random and there’s nothing we can do about it except for individually contacting website owners and asking them to add an exception for their feed URL.

[-]

stanislavb 10 hours ago

I was recently contacted by one of my website users as their RSS reader was blocked by Cloudflare.

sammy2255 10 hours ago

Unfortunately its not really Cloudflare but webadmins who have configured it to block everything thats not a browser, whether unknowingly or not

[-]

afandian 9 hours ago

If Cloudflare offer a product, for a particular purpose, that breaks existing conventions of that purpose, then it’s Cloudflare.

[-]

sammy2255 9 hours ago

Not really. You wouldn’t complain to a fence company for blocking a path if there were hired to do exactly that

[-]

shakna 9 hours ago

Yes, I would. Experts are expected to relay back to their client with their thoughts on a matter, not just blindly do as they're told. Your builder is meant to do their due diligence, which includes making recommendations.

gsich 9 hours ago

They are enablers. They get part of the blame.

echoangle 9 hours ago

Well it doesn’t break the conventions of the purpose they offer it for. Cloudflare attempts to block non-human users, and this is supposed to be used for human-readable websites. If someone puts cloudflare in front of a RSS feed, that’s user error. It’s like someone putting a captcha in front of an API and then complaining that the Captcha provider is breaking conventions.

elwebmaster an hour ago

Using Cloudflare on your website could be blocking Safari users, Chrome users, or just any users. It’s totally broken. They have no way of measuring the false positives. Website owners are paying for it in lost revenue. And poor users who lose access for no fault of their own. Until some C-level exec at a BigTech randomly gets blocked and makes noise. But even then, Cloudflare will probably just whitelist that specific domain/IP. It is very interesting how I have never been blocked when trying to access Cloudflare itself, only blocked on their customer’s sites.

wraptile 5 hours ago

Cloudflare has been the bane of my web existance on Thai IP and a Linux Firefox fingerprint. I wonder how much traffic is lost because of Cloudflare and of course none of that is reported to the web admins so everyone continues with their jolly ignorance.

I wrote my own RSS bridge that scrapes websites using Scrapfly web scraping API that bypasses all that because it's so annoying that I can't even scrape some company's /blog that they are literally buying ads for but somehow have an anti-bot enabled that blocks all RSS readers.

Modern web is so anti social that the web 2.0 guys should be rolling in their "everything will be connected with APIs" graves by now.

[-]

vundercind 2 hours ago

The late '90s-'00s solution was to blackhole address blocks associated with entire countries or continents. It was easily worth it for many US sites that weren't super-huge to lose the the 0.1% of legitimate requests they'd get from, say, China or Thailand or Russia, to cut the speed their logs scrolled at by 99%.

The state of the art isn't much better today, it seems. Similar outcome with more steps.

whs 8 hours ago

My company runs a tech news website. We offer RSS feed as any Drupal website would, which content farm just scrape our RSS feed to rehost our content in full. This is usually fine for us - the content is CC-licensed and they do post the correct source. But they run thousands of different WordPress instances on the same IP and they individually fetch the feed.

In the end we had to use Cloudflare to rate limit the RSS endpoint.

[-]

kevincox 4 hours ago

> In the end we had to use Cloudflare to rate limit the RSS endpoint.

I think this is fine. You are solving a specific problem and still allowing some traffic. The problem with the Cloudflare default settings is that they block all requests leading to users failing to get any updates even when fetching the feed at a reasonable rate.

BTW in this case another solution may just be to configure proper caching headers. Even if you only cache for 5min at a time that will be at most 1 request every 5min per Cloudflare caching location (I don't know the exact configuration but typically use ~5 locations per origin, so that would be only 1req/min which is trivial load and will handle both these inconsiderate scrapers and regular users. You can also configure all fetches to come from a single location and then you would only need to actually serve the feed once per 5min)

yjftsjthsd-h 2 hours ago

> In the end we had to use Cloudflare to rate limit the RSS endpoint.

Isn't the correct solution to use CF to cache RSS endpoints aggressively?

MarvinYork 8 hours ago

In any case, it blocks German Telekom users. There is an ongoing dispute between Cloudflare and Telekom as to who pays for the traffic costs. Telekom is therefore throttling connections to Cloudflare. This is the reason why we can no longer use Cloudflare.

[-]

SSLy 5 hours ago

as much as I am not a fan of cloudflare's practices, in this particular case DTAG seems to be the party at fault.

hugoromano 2 hours ago

"could be blocking RSS users" it says it all "could". I use RSS on my websites, which are serviced by Cloudflare, and my users are not blocked. For that, fine-tuning and setting Configuration Rules at Cloudflare Dashboard are required. Anyone on a free has access to 10 Configuration Rules. I prefer using Cloudflare Workers to tune better, but there is a cost. My suggestion for RSS these days is to reduce the info on RSS feed to teasers, AI bots are using RSS to circumvent bans, and continue to scrape.

srmarm 2 hours ago

I'd have thought the website owner whitelisting their RSS feed URI (or pattern matching *.xml/*.rss) might be better than doing it based on the users agent string. For one you'd expect bot traffic on these end points and you're also not leaving a door open to anyone who fakes their user agent.

Looks like it should be possible under the WAF

veeti 10 hours ago

I believe that disabling "Bot Fight Mode" is not enough, you may also need to create a rule to disable "Browser Integrity Check".

mbo 9 hours ago

This is an active issue with Rate Your Music right now: https://rateyourmusic.com/rymzilla/view?id=6108

Unfixed for 4 months.

artooro 2 hours ago

This is a truly problematic issue that I've experienced as well. The best solution is probably for Cloudflare to figure out what normal RSS usage looks like and have a provision for that in their bot detection.

015a 2 hours ago

Suggesting that website operators should allowlist RSS clients through the Cloudflare bot detection system via their user-agent is a rather concerning recommendation.

idunnoman1222 2 hours ago

Yes, the way to retain your privacy is to not use the Internet

if you don’t like it, make your own Internet: assumedly one not funded by ads

ricardo81 9 hours ago

iirc even if you're listed as a "good bot" with Cloudflare, high security settings by the CF user can still result in 403s.

No idea if CF already does this, but allowing users to generate access tokens for 3rd party services would be another way of easing access alongside their apparent URL and IP whitelisting.

account42 6 hours ago

Or just normal human users with a niche browser like Firefox.

pointlessone 7 hours ago

I see this on a regular basis. My self-hosted RSS reader is blocked by Cloudflare even after my IP address was explicitly allowlisted by a few feed owners.

nfriedly 3 hours ago

Liliputing.com had this problem a couple of years ago. I emailed the author and he got it sorted out after a bit of back and forth.

prmoustache 8 hours ago

I believe this also pose issues to people running adblockers. I get tons of repetitive captchas on some websites.

Also other companies offering similar services like imperva seems to be straight banning my ip after one visit to a website with uBlock Origin I first get a captcha, then a page saying I am not allowed, and whatever I do, even using an extensionless chrome browser with a new profile I can't visit it anymore because my ip is banned.

[-]

acdha 4 hours ago

One thing to keep in mind is that the modern web sees a lot of spam and scraping, and ad revenue has been sliding for years. If you make your activity look like a not, most operators will assume you’re not generating revenue and block you. It sucks but thank a spammer for the situation.

rcarmo 10 hours ago

Ironically, the site seems to currently be hugged to death, so maybe they should consider using Cloudflare to deal with HN traffic?

[-]

sofixa 7 hours ago

Doesn't have to be using CloudFlare, just a static web host that will be able to scale to infinity (of which CloudFlare is one with Pages, but there's also Google with Firebase Hosting, AWS with Amplify, Microsoft with something in Azure with a verbose name, Netlify, Vercel, GitHub Pages, etc etc etc).

[-]

kawsper 4 hours ago

Or just add Varnish or Nginx configured with a cache in front.

[-]

vundercind 2 hours ago

I used to serve low-tens-of-MB .zip files—worse than a web page and a few images or what have you—statically from Apache2 on a boring Linux server that'd qualify as potato-tier today, with traffic spikes into the hundreds of thousands per minute. Tens of thousands per minute against other endpoints gated by PHP setting a header to tell Apache2 to serve the file directly if the client authenticated correctly, and I think that one could have gone a lot higher, never really gave it a workout. Wasn't even really taxing the hardware that much for either workload.

Before that, it was on a mediocre-even-at-the-time dedicated-cores VM. That caused performance problems... because its Internet "pipe" was straw-sized, it turned out. The server itself was fine.

Web server performance has regressed amazingly badly in the world of the Cloud. Even "serious" sites have decided the performance equivalent of shitty shared-host Web hosting is a great idea and that introducing all the problems of distributed computing at the architecture level will help their moderate-traffic site work better (LOL; LMFAO), so now they need Cloudflare and such just so their "scalable" solution doesn't fall over in a light breeze.

sofixa 4 hours ago

That can still exhaust system resources on the box it's running on (file descriptors, inodes, ports, CPU/memory/bandwidth, etc) if you hit it too big.

For something like entirely static content, it's so much easier (and cheaper, all of the static hosting providers have an extremely generous free tier) to use static hosting.

And I say this as an SRE by heart who runs Kubernetes and Nomad for fun across a number of nodes at home and in various providers - my blog is on a static host. Use the appropriate solution for each task.

timeon 8 hours ago

If it is unintentional DDoS, we can wait. Not everything needs to be on demand.

[-]

dewey 8 hours ago

The website is built to get attention, the attention is here right now. Nobody will remember to go back tomorrow and read the site again when it’s available.

[-]

BlueTemplar 3 hours ago

I'm not sure an open web can exist under this kind of assumption...

Once you start chasing views, it's going to come at the detriment of everything else.

[-]

dewey 3 hours ago

This happened at least 15 years ago and we are doing okay.

est 7 hours ago

Hmmm, that's why "feedburner" is^H^Hwas a thing, right?

We have come to full circle.

[-]

kevincox 4 hours ago

Yeah, this is the recommendation that I usually give people who reach out to support. Feedburner tends to be on the whitelists to avoids this problem.

soraminazuki 8 hours ago

This is an issue with techdirt.com. I contacted them about this through their feedback form a long time ago, but the issue still remains unfortunately.

hwj 8 hours ago

I had problems accessing Cloudflare-hosted websites via the Tor browser also. Don't know it that is still true.

timnetworks 3 hours ago

RSS is the future that is being kept from us for twenty years already, fusion can kick bricks.

dewey 8 hours ago

I’m using Miniflix and I always run into that on a few blogs which now I just stopped reading.

shaunpud 4 hours ago

Namesilo are the same, their csv/rss behind Cloudflare so don't even bother anymore with their auctions and their own interface is meh

anilakar 4 hours ago

...and there is a good number of people who see this as a feature, not a bug.

hkt 4 hours ago

It also manages to break IRC bots that do things like show the contents of the title tag when someone posts a link. Another cloudy annoyance, albeit a minor one.