- SSL/TLS: You will likely lose your Cloudflare-provided SSL certificate. Your site will only work if your origin server has its own valid certificate.
- Security & Performance: You will lose the performance benefits (caching, minification, global edge network) and security protections (DDoS mitigation, WAF) that Cloudflare provides.
- This will also reveal your backend internal IP addresses. Anyone can find permanent logs of public IP addresses used by even obscure domain names, so potential adversaries don't necessarily have to be paying attention at the exact right time to find it.
If anyone needs the internet to work again (or to get into your cf dashboard to generate API keys), if you have Cloudflare WARP installed, turning it on appears to fix otherwise broken sites. Maybe using 1.1.1.1 does too, but flipping the radio box was faster. Some parts of sites are still down, even after tunneling into to CF.
Its absurdly slow (like multiple minutes to get the login page to fully load for the login button to be pressable, due to catchpa...), but I was able to log into the dashboard. It's throwing lots of errors once inside, but I can navigate around some of it. YMMV.
My profile (including api tokens,) and websites pages all work, the accounts tab above website on the left does not.
A colleague of mine just came bursting through my office door in a panic, thinking he brought our site down since this happened just as he made some changes to our Cloudflare config. He was pretty relieved to see this post.
You joke and I think its funny, but as a junior engineer I would be quite proud if some small change I made was able to take down the mighty Cloudflare.
If I were Cloudflare it would mean an immediate job offer well above market. That junior engineer is either a genius or so lucky that they must be bred by Pierson’s Puppeteers or such a perfect manifestation of a human fuzzer that their skills must be utilized.
This reminds of a friend I had in college. We were assigned to the same group coding an advanced calculator in C. This guy didn't know anything about programming (he was mostly focused on his side biz of selling collector sneakers), so we assigned him to do all the testing, his job was to come up with weird equations and weird but valid way to present them to the calculator. And this dude somehow managed to crash almost all of our iterations except the few last ones. Really put the joke about a programmer, a tester, and a customer walk into a bar into perspective.
I love that he ended up making a very valuable contribution despite not knowing how to program -- other groups would have just been mad at him, had him do nothing, or had him do programming and gotten mad when it was crap or not finished.
I think the rate limits for Claude Code on the Web include VM time in general and not just LLM tokens. I have a desktop app with a full end to end testing suite which the agent would run for every session that probably burned up quite a bit.
I kind of did that back in the days when they released Worker KV, I tried to bulk upload a lot of data and it brought the whole service down, can confirm I was proud :D
It's also not exactly the least common way that this sort of huge multi-tenant service goes down. It's only as rare as it is because more or less all of them have had such outages in the past and built generic defenses (e.g. automated testing of customer changes, gradual rollout, automatic rollback, there are others but those are the ones that don't require any further explanation).
>You joke and I think its funny, but as a junior engineer I would be quite proud if some small change I made was able to take down the mighty Cloudflare.
I mean, with Cloudflare's recent (lack of) uptime, I would argue there's a degree of crashflation happening such that the prestige is less in doing so. I mean nowadays if a lawnmower drives by cloudflare and backfires that's enough to collapse the whole damn thing
Well its easy to cause damage by messing up the `rm` command, esp with `-fr` options. So don't take it as a proxy for some great skill which is required to cause damage.
You could easily cause great damage to your Cloudflare setup, but CF has measures to prevent random customers deleting stuff from taking down the entire service globally. Unless you have admin access to the entire CF system, you can't really cause much damage with rm.
It's also what was the cause of the Azure Front Doors global outage two weeks ago - https://aka.ms/air/YKYN-BWZ
"A specific sequence of customer configuration changes, performed across two different control plane build versions, resulted in incompatible customer configuration metadata being generated. These customer configuration changes themselves were valid and non-malicious – however they produced metadata that, when deployed to edge site servers, exposed a latent bug in the data plane. This incompatibility triggered a crash during asynchronous processing within the data plane service. This defect escaped detection due to a gap in our pre-production validation, since not all features are validated across different control plane build versions."
> May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.
I'd love to know more about what those specific circumstances were!
I'm pretty sure I crashed Gmail using something weird in its filters. It was a few years ago. Every time I did something specific (I don't remember what), it would freeze and then display a 502 error for a while.
What do you imagine would be the result if you brought down cloudflare with a legitimate config update (ie not specifically crafted to trigger known bugs) while not even working for them? If I were the customer "responsible" for this outage, I'd just be annoyed that their software is apparently so fragile.
I would be fine if it was my "fault", but I'm sure people in business would find a way to make me suffer.
But on a personal level, this is like ordering something at a restaurant and the cook burning the kitchen because they forgot to take out your pizza out of the oven or something.
I would be telling it to everyone over beers (but not my boss).
What’s funny is as I get older this feeling of relief turns more like a feeling of dread. The nice thing about problems that you cause is that you have considerable autonomy to fix them. Cloudflare goes down you’re sitting and waiting for a 3 party to fix something.
Can’t speak for GP but ultimately I’d rather it be my fault or my company’s fault so I have something I can directly do for my customers who can’t use our software. The sense of dread isn’t about failure but feeling empathy for others who might not make payroll on time or whatever because my service that they rely on is down. And the second order effects, like some employee of a customer being unable to make rent or be forced to take out a short term loan or whatever. The fallout from something like this can have an unexpected human cost at times. Thankfully it’s Tuesday, not a critical payroll day for most employees.
But why does this case specifically matter? What if their system was down due to their WiFi or other layers beyond your software? Would you feel the same as well?
What about all the other systems and people suffering elsewhere in the World?
I don't understand what point you're trying to make. Are you suggesting that if I can't feel empathy for everybody at once, or in every one of their circumstances, that I should not feel anything at all for anyone? That's not how anything works. Life (or, as I believe, God) brings us into contact with all kinds of people experiencing different levels of joy and pain. It's natural to empathize with the people you're around, whatever they're feeling. Don't over-complicate it.
So you would rather be incompetent than powerless? Choice of third party vendor on client facing services is still on you, so maybe you prefer your incompetence be more direct and tangible?
Even still, you should have policies in place to mitigate such eventualities, that way you can focus the incompetence into systematic issues instead. The larger the company, the less acceptable these failures become. Lessons learned is a better excuse for a shake and break startup than an established player that can pay to be secure.
At some point, the finger has to be pointed. Personally, I don't dread it pointing elsewhere. Just means I've done my due D and C.
If customers expected third party downtime to not affect their thing then you shouldn't have picked a third party provider or spent extra resources on not having a single point of failure? If they were happy with choosing the third party with knowledge of depending on said third party provider, then it was an accepted risk.
The problem is, I still get the wrong end of the stick when AWS or CF go down! Management doesn't care, understandably. They just want the money to keep coming in. It's hard to convince them that this is a pretty big problem. The only thing that will calm them down a bit is to tell them Twitter is also down. If that doesn't get them, I say ChatGPT is also down. Now NOBODY will get any work done! lol.
This is why you ALWAYS have a proposal ready. I literally had my ass saved by having tickets with reliability/redundancy work clearly laid out with comments by out of touch product/people managers deprioritizing the work after attempts to pull it off the backlog (in one infamous case for a notoriously poorly conceived and expensive failure of a project that haunted us again with lost opportunity cost).
The hilarious part of the whole story is that the same PMs and product managers were (and I cannot overemphasize this enough) absolutely militant orthodox agile practitioners with jira.
Every time a major cloud goes down, management tells us why don't we have a backup service that we can switch to. Then I tell them that a bunch of services worth a lot more than us are also down. Do you really want to spend the insane amount of resources to make sure our service stays up when the global internet is down?
Who decided to go with AWS of CF? If its a management decision tell them you need the resources to have a fallback if they want their system to be more reliable than AWS or CF.
Haha yeah I just got off the phone and I said, look, either this gets fixed soon or there's going to be news headlines with photographs of giant queues of people milling around in airports.
Maybe "Erleichterung" (relief)? But as a German "Schadenserleichterung" (also: notice the "s" between both compound word parts) rather sounds like a reduction of damage (since "Erleichterung" also means mitigation or alleviation).
right I thought of that at first and discarded it for that reason. Which the problem really is that the normal story of how Schadenfreude works as a bit of German language how to is that the component that it is other people's damage that is sparking joy is missing from the word itself, that interpretation must be known by the word user, if you were just creating the word and nobody had heard it before in the world it would be pretty reasonable for people to think you had just created a new word for masochism.
When I'm debugging something, I'm not usually looking for the solution to the problem; I'm looking for sufficient evidence that I didn't cause the problem. Once I have that, the velocity at which I work slows down
Maybe this isn’t great, but I get a hint of that feeling when I’m on an airplane and hear a baby crying. For a number of years, if I heard a baby crying, it was probably my baby and I had to deal with it. But now my kids are past that phase, so when I hear the crying, after that initial jolt of panic I realize that it isn’t my problem, and that does give me the warm fuzzies. Even though I do feel bad for the baby and their parents.
Related situation: you're at a family gathering and everyone has young kids running around. You hear a thump, and then some kid starts screaming. Conversation stops and every parent keenly listens to the screams to try and figure out whose kid just got hurt, then some other parent jumps up - it's not your kid! #phewphoria
Not quite, that’s more like taking pleasure in the misfortune of someone else. It’s close, but the specific relief bit that it is not _your_ misfortune is not captured
I woke up getting bombarded by multiple clients messages of sites not working, I shitted my pants because I've changed the config just yesterday. When I saw the status message "cloudflare down" I was so relieved.
Good that he worked it out so quick. I recently spent a day debugging email problems on Railway PaaS, because they silently closed an SMTP port without telling anyone.
You missed a great opportunity to dead-pan him with something like "No, Bob, not just our site, you brought down the entire Internet, look at this post!"
> In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack.
It still astounds me that the big dogs still do not phase config rollouts. Code is data, configs are data, they are one and the same. It was the same issue with the giant crowdstrike outage last year, they were rawdogging configs globally and a bad config made it out there and everything went kaboom.
You NEED to phase config rollouts like you phase code rollouts.
The big dogs absolutely do phase config rollouts as a general rule.
There are still two weaknesses:
1) Some configs are inherently global and cannot be phased. There's only one place to set them. E.g. if you run a webapp, this would be configs for the load balancer as opposed to configs for each webserver
2) Some configs have a cascading effect -- even though a config is applied to 1% of servers, it affects the other servers they interact with, and a bad thing spreads across the entire network
> Some configs are inherently global and cannot be phased
This is also why "it is always DNS". It's not that DNS itself is particularly unreliable, but rather that it is the one area where you can really screw up a whole system by running a single command, even if everything else is insanely redundant.
Sure, but that doesn't really help for user-facing services where people expect to either type a domain name in their browser or click on a search result, and end up on your website every time.
And the access controls of DNS services are often (but not always) not fine-grained enough to actually prevent someone from ignoring the procedure and changing every single subdomain at once.
> Sure, but that doesn't really help for user-facing services where people expect to either type a domain name in their browser or click on a search result, and end up on your website every time.
It does help. For example, at my company we have two public endpoints:
company-staging.com
company.com
We roll out changes to company-staging.com first and have smoke tests which hit that endpoint. If the smoketests fail we stop the rollout to company.com.
That doesn’t help with rolling out updates to the DNS for company.com which is the point here. It’s always DNS because your pre-production smoke tests can’t test your production DNS configuration.
But users are going to example.com. Not my-service-33.example.com.
So if you've got some configuration that has a problem that only appears at the root-level domain, no amount of subdomain testing is going to catch it.
I think it's uncharitable to jump to the conclusion that just because there was a config-based outage they don't do phased config rollouts. And even more uncharitable to compare them to crowdstrike.
I have read several cloudflare postmortems and my confidence in their systems is pretty low. They used to run their entire control plane out of a single datacenter which is amateur hour for a tech company that has over $60 billion in market cap.
I also don’t understand how it is uncharitable to compare them to crowdstrike as both companies run critical systems that affect a large number of people’s lives, and both companies seem to have outages at a similar rate (if anything, cloudflare breaks more often than crowdstrike).
> The larger-than-expected feature file was then propagated to all the machines that make up our network
> As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.
I was right. Global config rollout with bad data. Basically the same failure mode of crowdstrike.
It seem fairly logical to me? If a config change causes services to crash then rollout stops … at least in every phased rollout system i’ve ever built…
In a company I am no longer with I argued much the same when we rolled out "global CI/CD" on IAC. You made one change, committed and pushed, wham it's on 40+ server clusters globally. I hated it. The principal was enamored with it, "cattle not pets" and all that, but the result was things slowed down considerably because anyone working with it became so terrified of making big changes.
Because adversaries adapt quickly, they have a system that deploys their counter-adversary bits quickly without phasing - no matter whether they call them code or configs. See also: Crowdstrike.
Configuration changes are dangerous for CF it seems, and knocked down $NET almost 4% today. I wonder what the industry wide impact is for each of these outages?
>Configuration changes are dangerous for CF it seems, and knocked down $NET almost 4% today. I wonder what the industry wide impact is for each of these outages?
This is becoming the "new normal." It seems like every few months, there's another "outage" that takes down vast swathes of internet properties, since they're all dependent on a few platforms and those platforms are, clearly, poorly run.
This isn't rocket surgery here. Strong change management, QA processes and active business continuity planning/infrastructure would likely have caught this (or not), as is clear from other large platforms that we don't even think about because outages are so rare.
Like airline reservations systems[0], credit card authorization systems from VISA/MasterCard, American Express, etc.
Those systems (and others) have outages in the "once a decade" or even much, much, longer ranges. Are the folks over at SABRE and American Express that much smarter and better than Cloudflare/AWS/Google Cloud/etc.? No. Not even close. What they are is careful as they know their business is dependent on making sure their customers can use their services anytime/anywhere, without issue.
It amazes me the level of "Stockholm Syndrome"[1] expressed by many posting to this thread, expressing relief that it wasn't "an attack" and essentially blaming themselves for not having the right tools (API keys, etc.) to recover from the gross incompetence of, this time at least, Cloudflare.
I don't doubt that I'll get lots of push back from folks claiming, "it's hard to do things at scale," and/or "there are way too many moving parts," and the like.
Other organizations like the ones I mention above don't screw they're customers every 4-6 months with (clearly) insufficiently tested configuration and infrastructure changes.
Yet many here seem to think that's fine, even though such outages are often crushing to their businesses. But if the customers of these huge providers don't demand better, they'll only get worse. And that's not (at least in my experience) a very deep or profound idea.
Pretty much everything is down (checking from the Netherlands). The Cloudflare dashboard itself is experiencing an outage as well.
Not-so-funny thing is that the Betterstack dashboard is down but our status page hosted by Betterstack is up, and we can't access the dashboard to create an incident and let our customers know what's going on.
Yep that's also my experience. Except HN because it does not use *** Cloudflare because it knows it is not necessary. I just wrote a blog titled "Do Not Put Your Site Behind Cloudflare if You Don't Need To" [1].
No, since they're simply too many. For an e-commerce site I work for, we once had an issue where some bad-actor tried to crawl the site to set up scam shops. The list of IPs were way too broad, and the user-agents way too generic or random.
Could you not also use an ASN list like https://github.com/brianhama/bad-asn-list and add blocks of IPs to a blocklist (eg. ipset on Linux)? Most of the scripty traffic comes from VPSs.
Thanks to widespread botnets, most scrapers fall back to using "residential proxies" the moment you block their cloud addresses. Same load, but now you risk accidentally blocking customers coming from similar net blocks.
Blocking ASNs is one step of the fight, but unfortunately it's not the solution.
Hypothetically, as a cyber-criminal, I'd like to thank the blacklist industry for bringing so much money into criminal enterprises by making residential proxies mandatory for all scraping.
Its not one IP to block. Its thousands! And they're also scatter through different ip networks so no simple cidr block is possible. Oh, and just for the fun, when you block their datacenter ips they switch to hundreds of residential network ips.
Yes, they are really hard to block. In the end I switched to Cloudflare to just so they can handle this mess.
Wouldn't it be trivial to just to write a uwf to block the crawler ips?
Probably more effective would be to get the bots to exclude your IP/domain. I do this for SSH, leaving it open on my public SFTP servers on purpose. [1] If I can get 5 bot owners to exclude me that could be upwards of 250k+ nodes mostly mobile IP's that stop talking to me. Just create something that confuses and craps up the bots. With SSH bots this is trivial as most SSH bot libraries and code are unmaintained and poorly written to begin with. In my ssh example look for the VersionAddendum. Old versions of ssh, old ssh libraries and code that tries to implement ssh itself will choke on a long banner string. Not to be confused with the text banner file.
I'm sure the clever people here could make something similar for HTTPS and especially for GPT/LLM bots at the risk of being flagged "malicious".
Belated response as I called it a night over here in sunny Australia!
The image scraping bots are training for generative AI, I'm assuming.
As to why they literally scrape the same images hundreds of thousands of times?
I have no idea!
But I am not special, the bots have been doing it across the internet.
My main difference to other sites is that I operate a Tourism focused SAAS for local organisations and government tourist boards. Which means we have a very healthy amount of images being served per page across our sites.
We also do on the fly transformations for responsive images and formats.
Which is all done through Cloudinary.
The Bytespider bot (Bytedance / TikTok) was the one that was being abusive for me.
Bad actors now have access to tens of thousands of IPs and servers on the fly.
The cost of hardware and software resources these days is absolute peanuts compared to 10 years ago. Cloud services and APIs has made managing them also trivial as hell.
Cloudflare is simply a evolution in response to the other side also having evolved greatly, both legitimate and illegitimate users.
Yes, I never understand this obsession for centralized services like Cloudflare. To be fair though, if our tiny blogs anyway had a hundred or so visitors monthly, does it matter if it had an outage for a day?
Interesting. I've done a lot of manual work to set up a whole nginx layer to properly route stuff through one domain to various self-hosted services, with way to many hard lessons when I started this journey (from trying to do manual setup without docker, to moving onto repeatable setups via docker, etc.).
The setup appears very simple in Caddy - amazingly simple, honestly. I'm going to give it a good try.
Cloudflare explicitly supports customers placing insecure HTTP only sites behind a cloudflare HTTPS.
It's one of the more controversial parts of the business, it makes the fact that the traffic is unencrypted on public networks invisible to the end user.
1. DDOS protection is not the only thing anymore, I use cloudflare because of vast amounts of AI bots from thousands of ASNs around the world crawling my CI servers (bloated Java VMs on very undersized hosts) and bringing them down (granted, I threw cloudflare onto my static sites as well which was not really necessary, I just liked their analytics UX)
2. the XKCD comic is mis-interpreted there, that little block is small because it's a "small open source project run by one person", cloudflare is the opposite of that
3. edit: also cloudflare is awesome if you are migrating hosts, did a migration this past month, you point cloudflare to the new servers and it's instant DNS propagation (since you didnt propagate anything :) )
It’s that time of the year again where we all realize that relying on AWS and Cloudflare to this degree is pretty dangerous but then again it’s difficult to switch at this point.
If there is a slight positive note to all this, then it is that these outages are so large that customers usually seem to be quite understanding.
Unless you’re say at airport trying to file a luggage claim … or at the pharmacy trying to get your prescription. I think as a community we have a responsibility to do better than this.
I always see such negative responses when HN brings up software bloat ("why is your static site measured in megabytes").
Now that we have an abundance of compute and most people run devices more powerful than the devices that put man on the moon, it's easier than ever to make app bloat, especially when using a framework like Electron or React Native.
People take it personally when you say they write poor quality software, but it's not a personal attack, it's an observation of modern software practices.
And I'm guilty of this, mainly because I work for companies that prioritize speed of development over quality of software, and I suspect most developers are in this trap.
I think we have a new normal now though. Most web devs starting now don't know a world without React/Vue/Solid/whatever. Like, sure you can roll your own HTML site with JS for interactivity, but employers now don't seem to care about that; if you don't know React then don't bother.
You aren’t cloudflare’s customer in these examples. It depends on the companies that are actually paying for and using the service to complain. Odds are that they won’t care on your behalf due to how our society is structured.
Not really sure how our community is supposed to deal with this.
“We” are the ones making the architecture and the technical specs of these services. Taking care for it to still work when your favourite FAANGMC is down seems like something we can help with.
> If there is a slight positive note to all this, then it is that these outages are so large that customers usually seem to be quite understanding.
Which only shows that chasing five 9s is worthless for almost all web products. The idea is that by relying on AWS or Cloudflare you can push your uptime numbers up to that standard, but these companies themselves are having such frequent outages that customers themselves don't expect that kind reliability from web products.
If I choose AWS/cloudflare and we're down with half of the internet, then I don't even need to explain it to my boss' bosses, because there will be an article in the mainstream media.
If I choose something else, we're down, and our competitors aren't, then my overlords will start asking a lot of questions.
Yup. AWS went down at a previous job and everyone basically took the day off and the company collectively chuckled. Cloudflare is interesting because most execs don’t know about it so I’d imagine they’d be less forgiving. “So what does cloudflare do for us exactly? Don’t we already have aws?”
Or _you_ aren't down, but a third-party you depend on is (auth0, payment gateway, what have you), and you invested a lot of time and effort into being reliable, but it was all for less than nothing, because your website loads but customers can't purchase, and they associate the problem with you, not with the AWS outage.
In reality it is not half of the internet. That is just marketing. I've personally noticed one news site while others were working. And I guess sites like that will get the blame.
Happy to hear anyone's suggestions about where else to go or what else to do in regards to protecting from large-scale volumetric DDoS attacks. Pretty much every CDN provider nowadays has stacked up enough capacity to tank these kind of attacks, good luck trying to combat these yourself these days?
Somehow KiwiFarms figured it out with their own "KiwiFlare" DDOS mitigation. Unfortunately, all of the other Cloudflare-like services seem exceptionally shady, will be less reliable than Cloudflare, and probably share data with foreign intelligence services I have even less trust for than the ones Cloudflare possibly shares them with.
Unfortunately Anubis doesn't help where my pipe to the internet isn't fat enough to just eat up all the bandwidth that the attacker has available. Renting tens of terabits of capacity isn't cheap and DDoS attacks nowadays are in the scale of that. BunnyCDN's DDoS protection is unfortunately too basic to filter out anything that's ever so slightly more sophisticated. Cloudflare's flexibility in terms of custom rulesets and their global pre-trained rulesets (based on attacks they've seen in the past) is imo just unbeatable at this time.
The Bunny Shield is quite similar to the Cloudflare setup. Maybe not 100% overlap of features but unless you’re Twitter or Facebook, it’s probably enough.
I think at the very least, one should plan the ability to switch to an alternative when your main choice fails… which together with AWS and GitHub is a weekly event now.
Why do people on a technical website suggest this? It's literally the same snake oil as Cloudflare. Both have an endgame of total web DRM; they want to make sure users "aren't bots". Each time the DRM is cracked, they will increase its complexity of the "verifier". You will be running arbitrary code in your big 4 browser to ensure you're running a certified big 4 browser, with 10 trillion man hours of development, on an certified OS.
And if you do rule based blocking they just change their approach. I am constantly blocking big corps these days, barely any work with normal bad actors.
What do they even have an spider for? I never saw any actual traffic with source Facebook. I don't understand either, but it's their official IPs, their official bot headers and it behaves exactly like someone who wants my sites down.
Does it make sense? Nah, but is it part of the weird reality we live in. Looks like it
I have no way of contacting Facebook. All I can do is keep complaining on hackernews whenever the topic arrises.
Edit:// Oh and I see the same with Azure, however there I have no list of IPs to verify it's official just because it looks like it.
5 9's is like 7 minutes a year. They are breaking SLAs and impacting services people depend on
Tbh though this is sort of all the other companies fault, "everyone" uses aws and cf and so others follow. now not only are all your chicks in one basket, so is everyone elses. When the basket inevitably falls into a lake....
Providers need to be more aware of their global impact in outages, and customers need to be more diverse in their spread.
These kinds of outages continue to happen and continue to impact 50+% of the internet, yes, they know they have that power, but they dont treat changes as such, so no, they arent aware. Awareness would imply more care in operations like code changes and deployments.
Outages happen, code changes occur; but you can do a lot to prevent these things on a large scale, and they simply dont.
Where is the A/B deployment, preventing a full outage? What about internally, where was the validation before the change, was the testing run against a prodlike environment or something that once resembled prod but hasnt forever?
They could absolutely mitigate impacting the entire global infra in multiple ways, and havent, despite their many outages.
They are aware. They don't want to pay the cost benefit tradeoff. Education won't help - this is a very heavily argued tradeoff in every large software company.
I do think this is tenable as long as these services are reliable. Even though there have been some outages I would argue that they’re incredibly reliable at this point. If though this ever changes the costs to move to a competitor won’t be as simple as pushing a repository elsewhere, especially for AWS. I think that’s where some of the potential danger lies.
> and judging by the HN post age, we're now past minute 60 of this incident.
Huh? It's been back up during most of this time. It was up and then briefly went back down again but it's been up for a while now. Total downtime was closer to 30 minutes
I'm already logged in on the cloudflare dashboard and trying to disable the CF proxy, but getting "404 | Either this page does not exist, or you do not have permission to access it" when trying to access the DNS configuration page.
Not saying not to do this to get through, but just as an observation, it’s also the sort of thing that can make these issues a nightmare to remediate, since the outage can actually draw more traffic just as things are warming up, from customers desperate to get through.
And I got a 504 error (served by CloudFront) on that status page earlier. The error message suggested there may have been a great increase in traffic that caused it.
Maybe that's precisely what Cloudflare did and now their status page is down because it's receiving an unusual amount of traffic that the VPS can't handle.
Could always just use a status page that updates itself. For my side project Total Real Returns [1], if you scroll down and look at the page footer, I have a live status/uptime widget [2] (just an <img> tag, no JS) which links to an externally-hosted status page [3]. Obviously not critical for a side project, but kind of neat, and was fun to build. :)
This is unrelated to the cloudflare incident but thanks a lot for making that page. I keep checking it from time to time and it's basically the main data source for my long term investing.
1- Does GCP also have any outages recently similar to AWS, Azure or CF? If a similar size (14 TB?) DDoS were to hit GCP, would it stand or would it fail?
2- If this DDoS was targeting Fly.io, would it stand? :)
I actually spoke too soon, and accept I have egg on my face!
Apparently prisma's `npm exec prisma generate` command tries to download "engine binaries" from https://binaries.prisma.sh, which is behind... guess what...
So now my CI/CD is broken, while my production env is down, and I can't fix it.
When its back up, do yourself a favour and rent a $5/mo vps in another country from a provider like OVH or Hetzner and stick your status page on that.
"Yes but what if they go down" - it doesnt matter, having it hosted by someone who can be down for the same reason as your main product/service is a recipe for disaster.
Definitely. Tangentially, I encountered 504 Gateway Timeout errors on cloudflarestatus.com about an hour ago. The error page also disclosed the fact that it's powered by CloudFront (Amazon's CDN).
Been using Cachet for quite a while before inevitably migrating to Atlassian's Statuspage.io. I'm a huge fan of self-hosting and self-managing every single thing in existence but Cachet was just such a PITA to maintain and there was just no other good alternative to Cachet that was also open source.
Seems like workers are less affected and maybe betterstack has decided to bypass cloudflare "stuff" for the status pages? (maybe to cut down costs). My site is still up though some GitHub runners did show it failed at certain points.
Pretty sure they went down for a while because I have 4xx errors they returned but apparently it was short-lived. I wonder if their workers infra. failed for a moment and that let to a total collapse of all of their products?
I don't get why you need such a service for a status page with 99.whatever% uptime. I mean, your status page only has to be up if everything else is down, so maybe 1% uptime is fine.
There's something maliciously satisfying about seeing your own self-hosted stuff working while things behind Cloudflare or AWS are broken. Sure, they have like four more nines that me, but right now I'm sitting pretty.
My (s)crappy personal site was up during the AWS outage, Azure outage and now Cloud flare outage. And I have it for 2 months only! Maybe I can add a tracker somewhere, might be fun.
This is a real problem for some some “old-school enterprise” companies that use Oracle, SAP, etc. along with the new AWS/CF based services. They are all waiting around for new apps to come back up while their Oracle suite/SAP are still functioning. There is a lesson here for some of these new companies selling to old-school companies.
How do you deal with DNS? I'm hosting something on a Raspberry Pi at home, and I had recently moved the DNS to Cloudflare. It's quite funny seeing my small personal website being down, although quite satisfying seeing both the browser and host with a green tick while Cloudflare is down.
DNS is actually one of the easiest services to self-host, and it's fairly tolerant of downtime due to caching. If you want redundancy/geographical distribution, Hurricane Electric has a free secondary/slave DNS service [0] where they'll automatically mirror your primary/master DNS server.
I don't have experience with a dynDNS setup like you describe, hosting from (probably) home. But my domains are on a VPS (and a few other places here and there) and DNS is done via my domain reseller's DNS settings pages.
Never had an issue hosting my stuff, but as said - don't yet have experience hoting something from home with a more dynamic DNS setup.
I was just able to save a proxied site. Then the dashboard went down again. I didn't even know it was still on. It's really not doing anything for performance because the traffic is quite low.
Is it me or has there been a very noticeable uptick in large scale infra-level outages lately? AWS, Cloudflare, etc have all been way under whatever SLA they publish.
Imagine vibe coding something in production, it breaks half the internet, then you can't vibe code it back because it broke the LLM providers. A real catch-22 for the modern age!
That does seem to be a coincidence, as the recent outages making headlines (including this one according to early reports) have been associated with huge traffic spikes. It seems DDoS are reaching a new level.
For me the only silver lining to all these cloud outages is now we know that their published SLA times mean absolutely nothing. The number of 9's used to at least give an indication of intent of reliability, now they are twisted to whatever metric the company wants to represent and dont actually represent guaranteed uptime anywhere.
Doesn’t everyone do that? I’ve never worked for a place that the base policy wasn’t credits. You might have special contract language stating otherwise, but for almost everyone, it’s credits.
Some of the other commenters here have posited a "vibe code theory". As the amount of vibe code in production increases, so does the number of bugs and, therefore, the number of outages.
None of the recent major outages were traced down to "vibe coding" or anything of the sort. They appear to be the kind of misconfigurations and networking fuckups that existed since Internet became more complex than 3 routers.
The "vibe thinking" trend where people stop using their brain and rely on whatever random output the LLM tells them is harder to diagnose, but it's certainly there and at least as bad as vibe coding.
What about the “vibe thinking” trend where people project their own narratives on to every situation, even if the information available shows that it’s a rise in large scale DDoS attacks?
> Some of the other commenters here have posited a "vibe code theory". As the amount of vibe code in production increases, so does the number of bugs and, therefore, the number of outages.
Likely this coupled with the mass brain damage caused by never-ending COVID re-infections.
Since vaccines don't prevent transmission, and each re-infection increases the chances of long COVID complications, the only real protection right now is wearing a proper respirator everywhere you go, and basically nobody is doing that anymore.
Most people are not self reflective reflective enough to notice. Need to trust the studies.
Far more plausible than the AI ideas.
I find it far more likely these are smart people running without oversight for years pre-COVID, relying on being smart at 2am change windows. Now half or a full std. dev. lower on the IQ scale, hubris means fewer guard rails before change, and far lower ability to recover during change window.
The theory I’ve heard is holiday deploy freezes coupled with Q4 goals creates pressure to get things in quickly and early. It’s all been in the last month or so which does line up.
This only amplifies the often-repeated propaganda about the "very powerful" enemies of democracy, who in fact are very fragile dictatorships. There's enough incompetence at tech companies to f up their own stuff.
If it's any guidance, US cyber risk insurance (which covers among other things disruptions due to supplier outages) has continuously dropped in price since Q1 2023, with a handful of percent per year.
Somewhere, at a floating desk behind a wall of lava lamps, in a nyancatified ghostty terminal with 32 different shader plugins installed:
You're absolutely right! I shouldn't have force pushed that change to master. Let me try and roll it back. * Confrobulating* Oh no! Cloudflare appears to be down and I cannot revert the change. Why don't you go make a cup of coffee until that comes back. This code is production ready, it's probably just a blip.
Even many non tech people have begun to associate Internet wide outages with “aws must be down” so I imagine many of them searching “is aws down” and for down detector, a hit is a down report, so it will report aws impacts even when the culprit is cloudflare in this case
interesting, maybe "AWS is down" will become the new "the server is down" that some non-tech people throw around when anything unexpected happen on their computer?
How did we get to a place where either Cloudflare or AWS having an outage means a large part of the web going down? This centralization is very worrying.
Oddly this centralization allows a complete deferral of blame without you even doing anything: if you’re down, that’s bad. But if you’re down, Spotify is down, social media is down… then “the internet is broken” and you don’t look so bad.
It also reduces your incentive to change, if “the internet is down” people will put down their device and do something else. Even if your web site is up they’ll assume it isn’t.
I’m not saying this is a good thing but I’m simply being realistic about why we ended up where we are.
As a user I do care, because I waste so much time on Cloudflare's "prove you are human" blocking-page (why do I have to prove it over and over again?), and frequently run on websites blocking me entirely based on some bad IP-blacklist used along with Cloudflare.
If you have a site with valuable content the LLM crawlers hound you to no end. CF is basically a protection racket at this point for many sites. It doesnt even stop the more determined ones but it keeps some away.
Oh, they're still botnets. We just look the other way because they're useful.
And they're pretty tame as far as computer fraud goes - if my device gets compromised I'd much rather deal with it being used for fake YouTube views than ransomware or a banking trojan.
You can make a little bit of cash on the side letting companies use your bandwidth a bit for proxying. You won’t even notice. $50/month. Times are tough!
Of course the risk here being whatever nefarious or illegal shit is flowing through your pipes, which you consented to and even received consideration for.
Absolutely. They have dramatically worsened the world, with little to no net positive impact. Nearly every (if not all) positive impacts have an associated negative that that dwarfs it.
LLMs aren't going anywhere, but the world would be a better place if they hadn't been developed. Even if they had more positive impacts, those would not outweigh the massive environmental degradation they are causing or the massive disincentive they created against researching other, more useful forms of AI.
IMO LLMs have been a net negative on society, including my life. But I'm merely pointing out the stark contrast on this website, and that fact that we can choose to live differently.
I am not anti-AI, nor unhappy about how any current LLM works. I'm unhappy about how AI is used and abused to collective detriment. LLM scraper spam leading to increased centralization and wider impacting failures is just one example.
Your position is similar to saying that medical drugs have been a net negative on society, because some drugs have been used and abused to collective detriment (and other negative effects, such as doctors prescribing pills instead of suggesting lifestyle changes). Does it mean that we would be better off without any medical drugs?
My position is that the negatives outweigh the positives, and I don't appreciate your straw man response. It's clear your question is not genuine and you're here to be contrarian.
A solid secondary option is making LLM scraping for training opt-in, and/or compensating sites that were/are scraped for training data. Hell, maybe then you could not knock websites over incentivizing them to use Cloudflare in the first place.
But that means LLM researchers have to respect other people's IP which hasn't been high on their todo lists as yet.
bUt ThAT dOeSn'T sCaLe - not my fuckin problem chief. If you as an LLM developer are finding your IP banned or you as a web user are sick of doing "prove you're human" challenges, it isn't the website's fault. They're trying to control costs being arbitrarily put onto them by a disinterested 3rd party who feels entitled to their content, which it costs them money to deliver. Blame the asshole scraping sites left and right.
Edit: and you wouldn't even need to go THAT far. I scrape a whole bunch of sites for some tools I built and a homemade news aggregator. My IP has never been flagged because I keep the number of requests down wherever possible, and rate-limit them so it's more in line with human like browsing. Like so much of this could be solved with basic fucking courtesy.
Not to speak for the other poster, but... That's not a good-faith question.
Most of the problems on the internet in 2025 aren't because of one particular technology. They're because the modern web was based on gentleman's agreements and handshakes, and since those things have now gotten in the way of exponential profit increases on behalf of a few Stanford dropouts, they're being ignored writ large.
CF being down wouldn't be nearly as big of a deal if their service wasn't one of the main ways to protect against LLM crawlers that blatantly ignore robots.txt and other long-established means to control automated extraction of web content. But, well, it is one of the main ways.
Would it be one of the main ways to protect against LLM web scraping if we investigated one of the LLM startups for what is arguably a violation of the Computer Fraud and Abuse Act, arrested their C-suite, and sent each member to a medium-security federal prison (I don't know, maybe Leavenworth?) for multiple years after a fair trial?
I'm Sure there will be an investigation... By the SEC when the bubble pops and takes the S&P with it. No prison though, probably jobs at the next ponzi scheme
Unfortunately the problem isn't just "the internet sucks" it's "the internet sucks, and everyone uses it" - meaning people are not doing stuff offline, and a lot of our lives require us to be online.
That's a problem caused by bots and spammers and DDoSers, that Cloudflare is trying to alleviate.
And you generally don't have to prove it over and over again unless there's a high-risk signal associated with you, like you're using a VPN or have cookies disabled, etc. Which are great for protecting your privacy, but then obviously privacy means you do have to keep demonstrating you're not a bot.
You might say the problem CloudFlare is causing is lesser than the ones it's solving, but you can't say they're not causing a new, separate problem.
That they're trying counts for brownie points, it's not an excuse to be satisfied with something that still bothers a lot of people. Do better, CloudFlare.
"We have decided to endlessly punish you for using what few tools you have to avoid being exploited online, because it makes our multi-billion dollar business easier. Sucks to be you."
I just realized, why don't they have some "definitely human" third party cookie that caches your humanness for 24h or so? I'm sure there's a reason, I've heard third party cookies were less respected now, but can someone chime in on why this doesn't work and save a ton of compute?
Yes, there are several, and the good one (linked below) lets you use the "humanness" token across different websites without them being able to use it as a tracking signal / supercookie. It's very clever.
Privacy through uniformity, operational security by routine, herd immunity for privacy, traffic normalization, "anonymity set expansion", "nothing to hide" paradox, etc.
I.e., if you use Tor for "normie sites", then the fact that someone can be seen using Tor is no longer a reliable proxy for detecting them trying to see/do something confidential and it becomes harder to identify & target journalists, etc. just because they're using Tor.
Tor Browser has ~1M daily users. Tons of people use it for hitting sites that may be blocked in their country or they want to have some privacy like view pregnancy or health related articles and etc.
In addition to the reasons in sibling comment, this also acts as a filter for low-quality ad-based sites; same reason I close just about any website that gives me a popup about a ToS agreement.
It is a trade-off between convenience and freedom. Netflix vs buying your movies. Spotify vs mp3s. Most tech products have alternatives. But you need to be flexible and adjust your expectations. Most people are not willing to do that
The issue is that real life is not adaptable. Resources and capital are slow.
That's the whole issue with monopolies for example, innit? We envision "ideal free market dynamics" yet in practice everybody just centralizes for efficiency gains.
Right, and my point is that "ideal free market dynamics" conveniently always ignore this failure state that seems to always emerge as a logical consequence of its tenets.
I don't have a better solution, but it's a clear problem. Also, for some reason, more and more people (not you) will praise and attack anyone who doesn't defend state A (ideal equilibrium). Leaving no room to point out state B as a logical consequence of A which requires intervention.
The definition of a monopoly basically resolves to "those companies that don't get pressured to meaningfully compete on price or quality", it's a tautology. If a firm has to compete, it doesn't remain a monopoly. What's the point you're making here?
There absolutely are options but we aren't using them because nobody cares enough about these downsides. bsky is up, with Mastodon you even have choice between tons of servers and setting up your own. Yet, nobody cares enough about the occasional outage to switch. It's such a minor inconvenience that it won't move the needle one bit. If people actually cared, businesses would lose customers and correct the issue.
More like it's time for the pendulum to swing back...
We had very decentralized "internet" with BBSes, AOL, Prodigy, etc.
Then we centralized on AOL (ask anyone over 40 if they remember "AOL Keyword: ACME" plastered all over roadside billboards).
Then we revolted and decentralized across MySpace, Digg, Facebook, Reddit, etc.
Then we centralized on Facebook.
We are in the midst of a second decentralization...
...from an information consumer's perspective. From an internet infrastructure perspective, the trend has been consistently toward more decentralization. Initially, even after everyone moved away from AOL as their sole information source online, they were still accessing all the other sites over their AOL dial-up connection. Eventually, competitors arrived and, since AOL no longer had a monopoly on content, they lost their grip on the infrastructure monopoly.
Later, moving up the stack, the re-centralization around Facebook (and Google) allowed those sources to centralize power in identity management. Today, though, people increasingly only authenticate to Facebook or Google in order to authenticate to some 3rd party site. Eventually, competitors for auth will arrive (or already have ahem passkeys coughcough) and, as no one goes to Facebook anymore anyway, they'll lose grip on identity management.
It's an ebb and flow, but the fundamental capability for decentralization has existed in the technology behind the internet from the beginning. Adoption and acclimatization, however, is a much slower process.
These centralized services do and did solve problems. I'm old enough to remember renting a quarter rack, racking my own server and other infrastructure, and managing all that. That option hasn't gone away, but there are layers of abstraction at work that many people probably haven't and don't want to be exposed to.
Aaand even if we ignore the "benefit" of Cloudflare and AWS outages being blamed on them, rather than you, what does uptime look like for artisanaly hosted services on a quarter rack vs your average services on AWS and Cloudflare?
> Businesses and peoples’ livelihoods are online nowadays
What happened to having a business continuity plan? E.g. when your IT system is down, writing down incoming orders manually and filling them into the system when it's restored?
I have a creeping suspicion that people don't care about that, in which case they can't really expect more than to occasionally be forced into some downtime by factors outside of their control.
Either it's important enough to have contingencies in place, or it's not. Downtime will happen either way, no matter how brilliant the engineers working at these large orgs are. It's just that with so much centralization (probably too much) the blast range of any one outage will be really large.
My wife and I own a small theatre. We can process orders in-store just fine. Our customers can even avoid online processing fees if they purchase in-store. And if our POS system went down, we could absolutely fall back to pencil and paper.
Doesn't change the fact that 99% of our ticket sales happen online. People will even come in to the theatre to check us out (we're magicians and it's a small magic shop + magic-themed theatre - so people are curious and we get a lot of foot traffic) but, despite being in the store, despite being able to buy tickets right then and there and despite the fact that it would cost less to do so ... they invariably take a flyer and scan the QR code and buy online.
We might be kind of niche, since events usually sell to groups of people and it's rare that someone decides to attend an event by themselves right there on the spot. So that undoubtedly explains why people behave like this - they're texting friends and trying to see who is interested in going. But I'm still bringing us up as an example to illustrate just how "online" people are these days. Being online allows you to take a step back, read the reviews, price shop, order later and have things delivered to your house once you've decided to commit to purchasing. That's just normal these days for so many businesses and their customers.
I’m not so sure about that. The pre-internet age had a lot of forced “mental health breaks”. Phone lines went down. Mail was delayed. Trains stalled. Businesses and livelihoods continued to thrive.
The idea that we absolutely need 24/7 productivity is a new one and I’m not that convinced by it. Obviously there are some scenarios that need constant connectivity but those are more about safety (we don’t want the traffic lights to stop working everywhere) than profit.
Just want to correct the record here, as someone who worked at a local CLEC where we took availability quite seriously before the age of the self-defeatist software engineer.
Phone lines absolutely did not go down. Physical POTS lines (Yes, even the cheap residential ones) were required to have around 5 9s of availability, or approximately 5 minutes per year. And that's for a physical medium affected by weather, natural disasters, accidents, and physical maintenance. If we or the LEC did not meet those targets contracts would be breached and worst case the government would get involved.
Okay, as someone who also worked in that era I’ll be pedantic: internal phone systems went down. I experienced it multiple times so I certainly know it happened.
Most businesses are totally fine if they have a few hours of downtime. More uptime is better, but treating an outage like a disaster or an e-commerce site like a power plant is more about software engineer egos than business or customer needs.
If AWS is down, most businesses on AWS are also down, and it’s mostly fine for those businesses.
It's better to have diverse, imperfect infrastructure, than one form of infra that goes down with devastating results.
I'm being semi-flippant but people do need to cope with an internet that is less than 100% reliable. As the youth like to say, you need to touch grass
Being less flippant: an economy that is completely reliant on the internet is one vulnerable to cyberattacks, malware, catastrophic hardware loss
It also protects us from the malfeasance or incompetence of actors like Google (who are great stewards of internet infrastructure... until it's no longer in their interests)
I’ve worked in cloud consulting for a little over five years. I can say 95% of the time when I discuss the cost and complexity tradeoffs of their websites being down vs going multi region or god forbid “multi cloud”, they shrug and say, it will be fine if they are down for a couple of hours.
This was the same when I was doing consulting inside (ie large companies willing to pay the premium cost of AWS ProServe consultants) and outside working at 3rd party companies.
Wealthy, investment-bloated software companies will be fine.
Smaller companies that provide real world services or goods to make a much more meagre living that rely on some of the services sold to them by said software companies will be impacted much more greatly.
Losing a day or two of sales to someone who relies on making sales every day can be a growing hardship.
This doesn’t just impact developers. It’s exactly this kind of myopic thinking that leads to scenarios like mass outages.
> But if you’re down, Spotify is down, social media is down… then “the internet is broken” and you don’t look so bad.
In my direct experience, this isn't true if you're running something even vaguely mission-critical for your customers. Your customer's workers just know that they can't do their job for the day, and your customer's management just knows that the solution they shepherded through their organization is failing.
It's really quite funny, many of the ACTUALLY vital systems to running the world as we know it are running off of very different softwares. Cloudflare appears to have a much higher % of non vital systems running on it than say something like akamai.
If akamai went down i have a feeling you'd see a whole lot more real life chaos.
i also find the sentiment of "well we use a third party so blame them" completely baffeling.
if you run anything even remotely mission critical, not having a plan B which is executable and of which you are in control (and a plan C) will make you look completely incompetent.
There are very, very few events which some people who run mission critical systems accept as force majeur. Most of those are of the scale "national emergency" or worse.
100% this. While in my professional capacity I'm all in for reliability and redundancy, as an individual, I quite like these situations when it's obvious that I won't be getting any work done and it's out of my control, so I can go run some errands to or read a book, or just finish early.
Which "user" are you referring to? Cloudflare users or end product users?
End product users have no power, they can complain to support and maybe get a free month of service, but the 0.1% of customers that do that aren't going to turn the tide and have anything change.
Engineering teams using these services also get "covered" by them - they can finger point and say "everyone else was down too."
This is essentially the entire IT excuse for going to anything cloud. I see IT engineers all the time justifying that the downtime stops being their problem and they stop being to blame for it. There's zero personal responsibility in trying to preserve service, because it isn't "their problem" anymore. Anyone who thinks the cloud makes service more reliable is absolutely kidding themselves, because everyone who made the decision to go that way already knows it isn't true, it just won't be their problem to fix it.
If anyone in the industry actually cared about reliability and took personal stake in their system being up, everyone would be back on-prem.
Reliability is not even how the cloud got sold to the C Suite. Good God, when my last company started putting things on Azure back in 2015 stuff would break weekly, usually on Monday mornings.
No, the value proposition was always about saving money, turning CapEx into OpEx. Direct quote from my former CEO maybe 9 years ago: We are getting out of the business of buying servers.
Cloud engineering involves architecting for unexpected events: retry patterns, availability zones, multi-region fail over, that sort of thing.
Now - does it all add up to cost savings? I could not tell you. I have seen some case studies, but I also have been around long enough to take those with a big grain of salt.
That might have been true for some kind of organization, but definitely not for every kind. On the other side, there were start-ups that wanted the elasticity and no commitments. But both sides at least partially liked the "it's not on me anymore" feature.
It's amazing how there's so many cybersecurity incidents now. Bypassing IT will always backfire spectacularly, IT is the people that stop you from dumbing.
The opposite was/is true. If your cloud box can only be used by two people and IT don’t even know about it then IT can never be persuaded to provide the keys to the rest of the company as they were predisposed to doing.
I saw this stuff too many times, and it is precisely why the cloud exploded in use in about 2010.
One notable example was signing keys for builds for distribution actually. And IT had a habit of handing them out to absolutely everyone. Being able to audit who did the signing was done in spite of IT who could, of course, never be persuaded of the merit of any process they don’t own.
I won't discount your IT can be bad, but also if you're keeping something as core to your security as signing keys somewhere your IT can't audit, you are just as bad. And your IT won't be the ones fired when your keys leak.
IMHO it adds, but only if you are big enough. Netflix level. At that level, you go and dine with Bezos and negotiate a massive discount. For anyone else, I’d genuinely love to see the numbers that prove otherwise.
> There's zero personal responsibility
Unfortunately, this seems to be the unspoken mantra of modern IT management. Nobody wants to be directly accountable for anything, yet everyone wants to have their fingerprints on everything. A paradox of collaboration without ownership.
Cloud providers have formalized these deals actually. If you promise to spend X amount over Y period, you get Z discounts.
And this is not reserved instances, this is an org level pricing deal. Some have been calling it anti-competitive and saying the regulators need to look at the practice.
> IMHO it adds, but only if you are big enough. Netflix level. At that level, you go and dine with Bezos and negotiate a massive discount. For anyone else, I’d genuinely love to see the numbers that prove otherwise.
It adds if you're smart about using resources efficiently, at any level. And engineer the system to spin up / spin down as customers dictate.
For situations where resources are allocated but are only being utilized a low percentage (even < 50% in some cases), it is not cost effective. All that compute / RAM / disk / network etc. is just sitting there wasted.
I mean in the end it's about making a trade off that makes sense for your business.
If the business can live with a couple of hours downtime per year when "cloud" is down, and they think they can ship faster / have less crew / (insert perceived benefit), then I don't know why that is a problem.
More like "don't have choice". It's not like service provider gonna go to competition, because before you switch, it will be back.
Frankly it's a blessing, always being able to blame the cloud that management forced company to migrate to be "cheaper" (which half of the time turns out to be false anyway)
> It also reduces your incentive to change, if “the internet is down” people will put down their device and do something else. Even if your web site is up they’ll assume it isn’t.
I agree. When people talk about the enshittification of the internet, Cloudflare plays a significant role.
Admittedly when I wrote that I was thinking about the recent AWS outage. Anecdotally, I asked friends and family about their experience and they assumed the internet was down. Almost everything at my work runs on Google cloud so we were still running but we observed a notable dip in traffic during the outage all the same.
> it is still bad
No doubt. But there’s a calculation to make, is it bad enough to spend the extra money on mitigations, to hire extra devops folks to manage it all… and in the majority of end user facing cases the answer is no, it isn’t.
Where I've worked and we've been in the cloud I've always promoted just running in one AZ, I run my own things in one Hetzner DC (hel1). I've done hybrid cloud as well and in that case we only have one AZ for the on-premise stuff anyways (plus offsite backup)
That one time when an AZ goes down and your infra successfully fails over to the other two isn't worth it for a lot of my scale companies, ops consultants seem to be chasing high cloud spend to justify their own high cost. I also factor in that I live in Sweden where most infrastructure outages are exceptionally rare.
Ofc it depends on what kind of company you are and what you're providing.
Eh? It's because they are offering a service too good to refuse.
The internet this day is fucking dangerous and murderous as hell. We need Cloudflare just to keep services up due to the deluge of AI data scrapers and other garbage.
Many reasons but DDoS protection has massive network effects. The more customers you have (and therefore bandwidth provision) the easier it is to hold up against a DDoS, as DDoS are targeting just one (usually) customer.
So there are massive economies of scale. Small CDN with (say) 10,000 customers and 10mbit/sec per customer can handle 100gbit/s DDoS (way too simplistic, but hopefully you get the idea) - way too small.
If you have the same traffic provisioned on average per customer and have 1 million customers, you can handle a DDoS 100x the size.
Only way to compete with this is to massively overprovision bandwidth per customer (which is expensive, as those customers won't pay more just for you to have more redundancy because you are smaller).
In a way (like many things in infrastructure) CDNs are natural monopolies. The bigger you get -> the more bandwidth and PoP you can have -> more attractive to more customers (this repeats over and over).
It was probably very astute of Cloudflare to realise that offering such a generous free plan was a key step in this.
In a CDN, customers consume bandwidth; they do not contribute it. If Cloudflare adds 1 million free customers, they do not magically acquire 1 million extra pipes to the internet backbone. They acquire 1 million new liabilities that require more infrastructure investment.
All you are doing is echoing their pitch book. Of course they want to skim their share of the pie.
I imagine every single customer is provisioned based on some peak expected typical traffic and that's what they base their capital investment in bandwidth on.
However most customers are rarely at their peak, this gives you tremendous spare capacity to use to eat DDoS attacks, assuming that the attacks are uncorrelated. This gives you huge amounts of capacity that's frequently doing nothing. Cloudflare advertise this spare capacity as "DDoS protection."
I suppose in theory it might be possible to massively optimise utilisation of your links, but that would be at the cost of DDoS protection and might not improve your margin very meaningfully, especially is customers care a lot about being online.
> In a CDN, customers consume bandwidth; they do not contribute it
They contribute money which buys infrastructure.
> If Cloudflare adds 1 million free customers,
Is the free tier really customers? Regardless most of them are small that it doesn't cost cloudflare much anyways. The infrastructure is already there anyways. Its worth it to them for the good will it generates which leads to future paying customers. It probably also gives them visibility into what is good vs bad traffic.
1 million small sites could very well cost less to cloudflare than 1 big site.
OP is saying it's cheaper overall for a 10 million customer company to add infrastructure for 1 million more than it is for a 10,000 customer company to add infrastructure for 1000 more people.
If you're looking at this as a "share of the pie", it's probably not going to make sense. The industry is not zero sum.
You aren't understanding economy of scale, and peak to average ratios.
The same reason I use cloud compute -- elastic infrastructure because I can't afford the peaks -- is the same reason large service providers "work".
It's funny how we always focus on Cloudflare, but all cloud providers have this same concentration downside. I think it's because Cloudflare loves to talk out of both sides of their mouth.
The "economies of scale" defense of Cloudflare ignores a fundamental reality: 23.8 million websites run on Cloudflare's free tier versus only 210,000 paying customers or so. Free users are not a strategic asset. They are an uncompensated cost, full stop. Cloudflare doesn't absorb this loss out of altruism; they monetize it by building AI bot-detection systems, charging for bot mitigation, and extracting threat intelligence data. Today's outage was caused by a bug in Cloudflare's service to combat bots.
That's AI bots, BTW. Bots like Playwright or Crawl4AI, which provide a useful service to individuals using agentic AI. Cloudflare is hostile to these types of users, even though they likely cost websites nothing to support well.
The "scale saves money" argument commits a critical error: it counts only the benefits of concentration while externally distributing the costs.
Yes, economies of scale exist. But Cloudflare's scale creates catastrophic systemic risk that individual companies using cloud compute never would. An estimated $5-15 billion was lost for every hour of the outage according to Tom's Guide. That cost didn't disappear. It was transferred to millions of websites, businesses, and users who had zero choice in the matter.
Again, corporations shitting on free users. It's a bad habit and a dark pattern.
Even worse, were you hoping to call an Uber this morning for your $5K vacation? Good luck.
This is worse than pure economic inefficiency. Cloudflare operates as an authorized man-in-the-middle to 20% of the internet, decrypting and inspecting traffic flows. When their systems fail, not due to attacks, but to internal bugs in their monetization systems, they don't just lose uptime.
They create a security vulnerability where encrypted connections briefly lose their encryption guarantee. They've done this before (Cloudbleed), and they'll do it again. Stop pretending to have rational arguments with irrational future outcomes.
The deeper problem: compute, storage, and networking are cheap. The "we need Cloudflare's scale for DDoS protection" argument is a circular justification for the very concentration that makes DDoS attractive in the first place. In a fragmented internet with 10 CDNs, a successful DDoS on one affects 10% of users. In a Cloudflare-dependent internet, a DDoS, or a bug, affects 50%, if Cloudflare is unable to mitigate (or DDoSs themselves).
Cloudflare has inserted themselves as an unremovable chokepoint. Their business model depends on staying that chokepoint. Their argument for why they must stay a chokepoint is self-reinforcing. And every outage proves the model is rotten.
hang on, you're reading some kind of cloudflare advocacy in my post. apologies if i implied that. i don't like to come off as a crank is all. IMO cloudflare is an evil that needs to be defeated. i'm just explaining how their business model "works" and why massive economy of scale matters, to support the GP poster.
i don't even think they are evil because of the concentration of power, that's just a problematic issue. the evil part is they convince themselves they aren't the bad guys. that they are saving us from ourselves. that the things they do are net positives, or even absolute positives. like the whole "let's defend the internet from AI crawlers" position they appointed themselves sheriff on, that i think you're referencing. it's an extremely dangerous position we've allowed them to occupy.
> they monetize it
yes, and they can't do this without the scale.
> scale saves money
any company, uber for example, can design their infra to not rely on a sole provider. but why? their customers aren't going to leave in droves when a pretty reliable provider has the occasional hiccup. so it's not worth the cost, so why shouldn't they externalize it? uber isn't in business to make the internet a better place. so yes, scale does save money. you're arguing something at a higher principle than how architectural decisions are made.
i'm not defending economy of scale as a necessary evil. i'm just backing up that it's how cloudflare is built, and that it is in fact useful to customers.
In my opinion, DDoS is possible only because there is no network protocol for a host to control traffic filtering on upstream providers (deny traffic from certain subnets or countries). In this case everybody would prefer write their own systems rather than rely on a harmful monopoly.
The recent Azure DDoS used 500k botnet IPs. These will have been widely distributed across subnets and countries, so your blocking approach would not have been an effective mitigation.
Identifying and dynamically blocking the 500k offending IPs would certainly be possible technically -- 500k /32s is not a hard filtering problem -- but I seriously question the operational ability of internet providers to perform such granular blocking in real-time against dynamic targets.
I also have concerns that automated blocking protocols would be widely abused by bad actors who are able to engineer their way into the network at a carrier level (i.e. certain governments).
Is this really true? What device in the network are you loading that filter into? Is it even capable of handling the packet throughput of that many clients while also handling such a large block list?
But this is not one subnet. It is a large number of IPs distributed across a bunch of providers, and handled possibly by dozens if not hundreds of routers along the way. Each of these routers won't have trouble blocking a dozen or two IPs that would be currently involved in a DDoS attack.
But this would require a service like DNSBL / RBL which email providers use. Mutually trusting big players would exchange lists of IPs currently involved in DDoS attacks, and block them way downstream in their networks, a few hops from the originating machines. They could even notify the affected customers.
But this would require a lot of work to build, and a serious amount of care to operate correctly and efficiently. ISPs don't seem to have a monetary incentive to do that.
It also completely overlooks the fact that some of the traffic has spoofed source IP addresses and a bad actor could use automated black holing to knock a legitimate site offline.
That already exists… that's part of cloudflare and other vendors mitigation strategy. There’s absolutely no chance ISPs are going to extend that functionality to random individuals on the internet.
What traffic would you request the upstream providers to block if getting hit by Aisuru? Considering the botnet consists of residential routers, those are the same networks your users will be originating from. Sure, in best case, if your site is very regional, you can just block all traffic outside your country - but most services don't have this luxury.
Blocking individual IP addresses? Sure, but consider that before your service detects enough anomalous traffic from one particular IP and is able to send the request to block upstream, your service will already be down from the aggregate traffic. Even a "slow" ddos with <10 packets per second from one source is enough to saturate your 10Gbps link if the attacker has a million machines to originate traffic from.
In many cases the infected devices are in developing countries where none of your customers is. Many sites are regional, for example, a medium business operating within one country, or even city.
And even if the attack comes from your country, it is better to block part of the customers and figure out what to do next rather than have your site down.
Could it not be argued that ISPs should be forced to block users with vulnerable devices?
They have all the data on what CPE a user has, can send a letter and email with a deadline, and cut them off after it expires and the router has not been updated/is still exposed to the wide internet.
My dad’s small town ISP called him to say his household connection recently started saturating the link 24/7 and to look into whether a device had been compromised.
(Turns out some raspi reseller shipped a product with empty uname/password)
While a cute story, how do you scale that? And what about all the users that would be incapable of troubleshooting it, like if their laptop, roku, or smart lightbulb were compromised? They just lose internet?
And what about a botnet that doesn’t saturate your connection, how does your ISP even know? They get full access to your traffic for heuristics? What if it’s just one curl request per N seconds?
> While a cute story, how do you scale that? And what about all the users that would be incapable of troubleshooting it, like if their laptop, roku, or smart lightbulb were compromised? They just lose internet?
Uh, yes. Exactly and plainly that. We also go and suspend people's driver licenses or at the very least seriously fine them if they misbehave on the road, including driving around with unsafe cars.
Access to the Internet should be a privilege, not a right. Maybe the resulting anger from widespread crackdowns would be enough of a push for legislators to demand better security from device vendors.
> And what about a botnet that doesn’t saturate your connection, how does your ISP even know?
In ye olde days providers had (to have to) abuse@ mailboxes. Credible evidence of malicious behavior reported to these did lead to customers getting told to clean up shop or else.
Xfinity did exactly this to me a few years ago. I wasn't compromised but tried running a blockchain node on my machine. The connection to the whole house was blocked off until I stopped it.
> here is no network protocol for a host to control traffic filtering on upstream providers (deny traffic from certain subnets or countries).
There is no network protocol per se, but there is commercial solutions like fortinet that can block countries iirc, but to note that it's only ip range based so it's not worth a lot
Yeah, I went to HN after the third web page didn't work. I am not just worried about the single point of failure, I am much more worried about this centralization eventually shaping the future standards of the web and making it de facto impossible to self-host anything.
Well that and the fact that when 99% goes through a central party, then that central party will be very interesting for authoritarian governments to apply sweeping censorship rules to.
It is already nearly impossible/very expensive in my country to be able to get a public IP address (Even IPv6) which you could host on. World is heavily moving towards centrally dependant on these big Cloud providers.
What part of the world has any ipv6 limitations? In the USA An ISP will give you a /48 from their /32 if you have any colo arrangement without even a blink. That gives you 2^16 networks with essentially infinite number of hosts on each network. Zero additional charge.
It is not as bad as Cloudflare or AWS because certificates will not expire the instant there is an outage, but considers that:
- It serves about 2/3 of all websites
- TLS is becoming more and more critical over time. If certificates fail, the web may as well be down
- Certificate lifetimes are becoming shorter and shorter, now 90 days, but Let's Encrypt is now considering 6 days, with 47 days being planned as a minimum
- An outage is one thing, but should a compromise happen, that would be even more catastrophic
Let's Encrypt is a good guy now, but remember that Google used to be a good guy in the 2000s too!
(Disclaimer: I am tech lead of Let's Encrypt software engineering)
I'm also concerned about LE being a single point of failure for the internet! I really wish there were other free and open CAs out there. Our goal is to encrypt the web, not to perpetuate ourselves.
That said, I'm not sure the line of reasoning here really holds up? There's a big difference between this three-hour outage and the multi-day outage that would be necessary to prevent certificate renewal, even with 6-day certs. And there's an even bigger difference between this sort of network disruption and the kind of compromise that would be necessary to take LE out permanently.
So while yes, I share your fear about the internet-wide impact of total Let's Encrypt collapse, I don't think that these situations are particularly analogous.
Agree, I’ve thought about this one too. The history of SSL/TLS certs is pretty hacky anyway in my opinion. The main problem they are solving really should have been solved at the network layer with ubiquitous IPsec and key distribution via DNS since most users just blindly trust whatever root CAs ship with their browser or OS, and the ecosystem has been full of implementation and operational issues.
Let’s Encrypt is great at making the existing system less painful, and there are a few alternatives like ZeroSSL, but all of this automation is basically a pile of workarounds on top of a fundamentally inappropriate design.
its a shame DANE never took off.
If we actually got around to running a trusted DNSSEC based DNS system and allowed clients to create certificates thanks to DANE, we would be in a far more resilient setup compared to what we are now.
But DNSSEC was hard according to some, and now we are running a massive SPOF in terms of TLS certificates.
It didn't "not take off" --- it didn't work. You couldn't run it on the actual Internet with actual users, at least not without having a fallback path that attackers could trigger that meant DANE was really just yet another CA, only this one you can't detect misbehavior or kill it when it does misbehave.
There's not really a way around the initial trust problem with consumer oriented certs though. Yours could reduce the number of initially trusted down to one I think but not any further.
Mostly since the AWS craze started a decade ago, developers have gone away from Dedicated servers (which are actually cheaper, go figure), which is causing all this mess.
It's genuinely insane that many companies are designing a great amount of fallbacks... on the software level but almost none is thought on the hardware/infrastructure level, common-sense dictate that you should never host everything on a single provider.
I tried as hard as I could to stay self hosted (and my backend is, still), but getting constant DDoS attacks and not having the time to deal with fighting them 2-3x a month was what ultimately forced me to Cloudflare. It's still worse than before even with their layers of protection, and now I get to watch my site be down a while, with no ability to switch DNS to point back to my own proxy layer, since CF is down :/
This is wild. Was your website somehow controversial? Ive been running many different websites for over 30+ years now, and have never been the target of a DDOS. The closest I’ve seen was when one website had a blind time based sql injection vulnerability and the attacker was abusing it, all the SLEEP() injected into the database brought the server to a crawl. But that’s just one attacker from a handful of IPs, hardly what i would call a DDOS.
I made the mistake of telling people it was hosted on a Pi cluster in a YouTube video a couple years ago, and asked nobody to try DDoSing it. I was a bit more naive thinking the YouTube viewer community was more like HN where people may joke about it, but nobody would actually do it.
I was wrong, and ever since I've dealt with a targeted attack (which was evolving as I added more CF firewall rules). At this point it's taken care of, but only because I have most things completely blocked at the CF firewall layer.
Until I changed job recently, I spent the past 8 years working in an area of tech that many people on places like HN and Reddit think that the work is a horrific waste of effort (DRM and content security for a streaming company).
The idea that if companies like my former employer would stop doing DRM their audience would embrace it is rife idealism. But based on bitter experience so enough people will do bad things just for the lulz that you need to cover your ass.
My home lab will never have an open port, I'll always put things behind a CDN or zero trust system, even then...
FWIW, it's worthwhile just for educational reasons to look at abuseipdb.com quite revealing.
Jeff the reason why I think is that youtube community is more main-stream and I would consider you to be a really nice youtuber but even that might attract some bad faith actors just because of how main-stream youtube is as well compared to HN which is more niche-overall
(also congrats on 1 million subscribers but I know you must be tired of listening it but have a nice day jeff! Your videos are awesome!)
When I was younger and living in military dorms, I put a old throw away laptop hosting a simple website via Apache on the internet. Every time I checked the log it'd be full of so many random, wild spurts of attacks (granted I had basically 0 legit traffic).
I think people sometimes mistake legitimate traffic spikes for DDOS attacks. My blog has the former, but no site I have ever hosted has seen the latter.
With the state of constant attack from AI scrapers and DDOS bots, you pretty much need to have a CDN from someone now, if you have a serious business service. The poor guys with single prem boxes with static HTML can /maybe/ weather some of this storm alone but not everything.
This is the sad reality behind it. My websites would be constantly down because of AI scrapers. If anyone knows a good alternative, that doesn't cost an arm and a leg I am very open to hear!
I self hosted on one of the company’s servers back in the late 90s. Hard drive crashes (and a hack once, through an Apache bug) had our services (http, pop, smtp, nfs, smb, etc ) down for at least 2-3 days (full reinstall, reconfiguration, etc).
Then, with regular VPSs I also had systems down for 1-2 days. Just last week the company that hosts NextCloud for us was down the whole weekend (from Friday evening) and we couldn’t get their attention until Monday.
So far these huge outages that last 2-5 hours are still lower impact for me, and require me to take less action.
I like the idea of having my own rack in a data center somewhere (or sharing the rack, whatever) but even a tiny cost is still more than free. And even then, that data center will also have outages, with none of the benefits of a Cloudflare Pages, GitHub Pages, etc.
> developers have gone away from Dedicated servers (which are actually cheaper, go figure)
It depends on how you calculate your cost. If you only include the physical infrastructure having a dedicated server is cheaper. But by having some dedicated server you loose a lot of flexibility. Needs more resources? Just scale up your ec2, and with a dedicated server there is a lot more work involved.
Do you want a 'production-ready' database? With AWS you can just click a few buttons and have a RDS ready to use. To roll out your own PG installation you need someone with a lot of knowledge(how to configure replication? backups? updates? ...).
So if you include salaries in the calculation the result changes a lot. And even if you already have some experts in your payroll by putting them to work in deploying a PG instance you won't be able to use them to build other things that may generate more value to you business than the premium you pay to AWS.
Cloud-Hoster are that hardware-fallback. They started with offering better redundancy and scaling than your homemade breadbox. But it seems they lost something along the way and now we have this.
Maintainance cost is the main issue for on-prem infra, nowadays add things like DDOS protection and/or scraping protection, which can require dedicated team or for your company to rely on some library or open source project that is not guaranteed to be maintained forever (unless you give them support, which i believe in)... Yeah I can understand why companies shift off of on-prem nowadays
... dedis are cheaper if you are rightsized. If you are wrongsize they just plain crash and you may or may not be able to afford the upgrade.
I was at Softlayer before I was at AWS and what catalyzed the move was the time I needed to add another hard drive to a system and somehow they screwed it up. I couldn't put a trouble ticket it to get it fixed because my database record in their trouble ticket system was corrupted. The next day I moved my stuff to AWS and the day after that they had a top sales guy talk to me to try to get me to stay but it was too late.
This might sound crazy as a software engineer, but I actually like the occasional "snow day" where everything goes down. It's healthy for us to all disconnect from the internet for a bit. The centralization unintentionally helps facilitate that. At least, that's my glass half full perspective.
I can understand that sentiment. Just don't lose sight of the impact it can have on every day people. My wife and I own a small theatre and we sell tickets through Eventbrite. It's not my full time job but it is hers. Eventbrite sent out an email this morning letting us know that they are impacted by the outage. Our event page appears to be working but I do wonder if it's impacting ticket sales for this weekend's shows.
So while us in tech might like a "snow day", there are millions of small businesses and people trying to go about their day to day lives who get cut off because of someone else's fuck-ups when this happens.
Absolutely solid point; there are a couple of apps I use daily for productivity, chores, even for alarm scheduling, that with the free versions, the ads wouldn’t load so I couldn’t use them but some of them were updated already. Made me realize I forgot that we’re kind of like cyborgs relying on technology that’s integrated so deeply into our lives that all it takes is an EMP blast like a monopolistic service going down to bring -us- down until we take a breath and learn how to walk again. Wild time.
> This might sound crazy as a software engineer, but I actually like the occasional "snow day" where everything goes down
As as software engineer, I get it. as a CTO, I spent this morning triaging with my devops ai(actual Indian) to find some workaround (we found one) while our CEO was doing damage control with customers (non technical field) who were angry that we were down and they were losing business by the minute.
sometimes I miss not having a direct stake in the success of the business.
I'm guessing you're employed and your salary is guaranteed regardless. Would you have the same outlook if you were the self-employed founder of an online business and every minute of outage was costing you money?
If you're an event organizer whose big event is in two days, for example, then every minute your website's down translates to people not paying to attend your paid event. Bonus points because as event managers know, people often wait until 2 days before the event to subscribe for good. Bonus points if you knew this and therefore ran a costly email campaign just before the outage, a campaign that is now sitting at a near-0% click rate.
For businesses whose profit margins are already slim, which is most traditional businesses trading online, making less money than they usually would will put them into the red, and even for those that are still in profit, making less money than you usually would means you have less money to pay the expenses that you usually do, expenses that are predicated on you making a certain amount of revenue.
You're living in a bubble. I know enough people who live paycheck to paycheck and always have exactly $0 in their pocket before the end of the month. It's pretty normal in some parts of the world, maybe even most of them.
That's a weirdly flippant response to what's a serious issue, but I'll give it the courtesy of a reply anyway - maybe not, but a business not making enough profit might go under, or they might only have to fire someone to prevent that from happening.
Technically, multi-node cluster with failover (or full on active-active) will have far higher uptime than just a single node.
Practically, to get the multi-node cluster (for any non trivial workload) to work right, reliably, fail-over in every case etc. is far more work, far more code (that can have more bugs), and even if you do everything right and test what you can, unexpected stuff can still kill it. Like recently we had uncorrectable memory error which just happened to hit the ceph daemon just right that one of the OSDs misbehaved and bogged down entire cluster...
You jest, but this actually does exist. Multiple CDNs sell multi-CDN load balancing (divide traffic between 2+ CDNs per variously-complicated specifications, with failover) as a value add feature, and IIRC there is at least one company for which this is the marquee feature. It's also relatively doable in-house as these things go.
As someone who has worked for a CDN for over a decade, this is what most big customers do. Under normal circumstances, they send portions of traffic to different CDNs, usually based on cost (and or performance in various regions). When an issue happens, they will pull traffic from the problem CDN.
Of course, if a big incident happens for a big CDN, there might not be enough latent capacity in the other CDNs to take all the traffic. CDNs are a cutthroat business, with small margins, so there usually isn’t a TON of unused capacity laying around.
It costs a lot of money to move, you don't know if the alternative will be any better, and if it affects a lot of companies then it's nobody's fault. "Nobody ever got fired for buying Cloudflare/AWS" as they say.
it's not just that, it's the creation of a sorta status symbol, or at least of symbol of normality.
there was a point (maybe still) where not having a netflix subscription was seen as 'strange'.
if that's the case in your social circles -- and these kind of social things bother you -- you're not going to cancel the subscription due to bad service until it becomes a socially accepted norm.
It's just that customers are more understanding when they see their Netflix not working either otherwise they just think you're less professional. Try talking to customers after an outage and you will see.
except, yknow, where peoples lives and livelihoods depend on access to information/being able to do things on exact time. aws and cloudflare are disqualifying themselves from hospitals and military and whatnot.
For example, Cloudflare employees make money on promises to mitigate such attacks, but then can’t guarantee they will, and take all their customers down at once. It’s a shared pain model.
How did we get to a place where Cloudflare being down means we see an outage page, but on that page it tells us explicitly that the host we're trying to connect to is up, and it's just a Cloudflare problem.
If it can tell us that the host is up, surely it can just bypass itself to route traffic.
It's not only centralization in the sense your website will be down if they are down but it is also a centralized MITM proxy. If you transfer sensitive data like chats over cloudflare-"protected" endpoints, you also allow CF to transparently read and analyze it in plain-text. It must be very easy for state agencies to spy on the internet nowadays, they woukd just ask CF to redirect traffic to them.
Because it's better to have a really convenient and cheap service that works 99% of the time, than a resilient that is more expensive or more cumbersome to use.
It's like github vs whatever else you can do with git that is truly decentralized. The centralization has such massive benefits that I'm very happy to pay the price of "when it's down I can't work".
Most developers don't care to know how the underlying infrastructure works (or why) and so they take whatever the public consensus is re: infra as a statement of fact (for the better part of the last 15 years or so that was "just use the cloud"). A shocking amount of technical decisions are socially, not technically enforced.
This topic is raised every time there is an outage with cloudflare and the truth of the matter is, they offer an incredible service, there is not a bit enough competition to deal with it. By definition their services are so good BECAUSE their adoption rate is so high.
It's very frustrating of course, and it's the nature of the beast.
Because DDoS is a fact of life (and even if you aren't targeted by DDoS, the bot traffic probing you to see if you can be made part of the botnet is enough to take down a cheap $5 VPS). So we have to ask - why? Personally, I don't accept the hand-wavy explanation that botnets are "just a bunch of hacked IoT devices". No, your smart lightbulb isn't taking down Reddit. I slightly believe the secondary explanation that it's a bunch of hacked home routers. We know that home routers are full of things like suspicious oopsie definitely-not-government backdoors.
IMO, centralization is inevitable because the fundamental forces drive things in that direction. Clouds are useful for a variety of reasons (technical, time to market, economic), so developers want to use them. But clouds are expensive to build and operate, so there are only a few organizations with the budget and competency to do it well. So, as the market matures you end up with 3 to 5 major cloud operators per region, with another handful of smaller specialists. And that’s just the way it works. Fighting against that is to completely swim upstream with every market force in opposition.
Compliance. If you wanna sell your SAAS to big corpo, their compliance teams will feel you know what you're doing if they read AWS or Cloudflare on your architecture, even if you do not quite know what you're doing.
I would be less worried if Cloudflare and AWS weren't involved in many more things than simply running DNS.
AWS - someone touches DynamoDB and it kills the DNS.
Cloudflare - someone touches functionality completely unrelated to DNS hosting and proxying and, naturally, it kills the DNS.
There is this critical infrastructure that just becomes one small part of a wider product offering, worked on by many hands, and this critical infrastructure gets taken down by what is essentially a side-effect.
It's a strong argument to move to providers that just do one thing and do it well.
It's weird to think about so bear with me. I don't mean this sardonically or misanthropically. But, it's "just the internet." It's just the internet. It dones't REALLY matter in a large enough macro view. It's JUST the internet.
Well the centralisation without rapid recovery and practices that provide substantial resiliency… that would be worrying.
But I dare say the folks at these organisations take these matters incredibly seriously and the centralisation problem is largely one of risk efficiency.
I think there is no excuse, however, to not have multi region on state, and pilot light architectures just in case.
This was always the case. There was always a "us-east" in some capacity, under Equinix, etc. Except it used to be the only "zone," which is why the internet is still so brittle despite having multiple zones. People need to build out support for different zones. Old habits die hard, I guess.
A lot (and I mean a lot) of people in IT like centralization specifically because it’s hard to blame people for doing something that everyone else is doing.
> How did we get to a place where either Cloudflare or AWS having an outage means a large part of the web going down?
As always, in the name of "security". When are we going to learn that anything done, either by the government or by a corporation, in the name of security is always bad for the average person?
For most services it's safer to host from behind Cloudflare, and Cloudflare is considered more highly available than a single IaaS or PaaS, at least in my headcanon.
It's because single points of traffic concentration are the most surveillable architecture, so FVEY et al economically reward with one hand those companies who would build the architecture they want to surveil with the other hand.
Currently at the public library and I can't use the customer inventory terminals to search for books. They're just a web browser interface to the public facing website, and it's hosted behind CF. Bananas.
Agreed. More worrying is that it appears standard practice or separation between domain and nameserver administration has been lost to one-stop-shop marketing.
Short-term economic forces, probably. Centralization is often cheaper in the near term. The cost of designing in single-point failure modes gets paid later.
Don't forget the CloudStrike outage: One company had a bug that brought down almost everything. Who would have thought there are so many single points of failure across the entire Internet.
The same reason we have centralization across the economy. Economies of scale is how you make a big business succesful, and once you are on top its hard to dislodge you.
And all of these outages happening not long after most of them dismissed a large amount of experienced staff while moving jobs offshore to save in labor costs.
I think some of the issues in the last outage actually affected multiple regions. IIRC internally some critical infrastructure for AWS depends on us-east-1 or at least it failed in a way that didn't allow failover.
We take the idea of the internet always being on for granted. Most people don’t understand the stack and assume that when sites go down it’s isolated, and although I agree with you, it’s just as much complacency and lack of oversight and enforcement delays in bureaucracy as it is centralization. But I guess that’s kind of the umbrella to those things… lol
People use CloudFlare because it's a "free" way for most sites to not get exploited (WAF) or DDoSed (CDN/proxy) regularly. A DDoS can cost quite a bit more than a day of downtime, even just a thundering herd of legitimate users can explode an egress bill.
It sucks there's not more competition in this space but CloudFlare isn't widely used for no reason.
AWS also solves real problems people have. Maintaining infrastructure is expensive as is hardware service and maintenance. Redundancy is even harder and more expensive. You can run a fairly inexpensive and performant system on AWS for years for the cost of a single co-located server.
There is this tendency to phrase questions (or statements) as
"when did 'we' ".
These decision are made individually not centrally. There is no process in place (and most likely there will never be) that will be able to control and dictate if people decide one way of doing things is the best way to do it. Even assuming they understand everything or know of the pitfalls.
Even if you can control individually what you do for the site you operate (or are involved in) you won't have any control on parts of your site (or business) that you rely on where others use AWS or Cloudflare.
Re: Cloudflare it is because developers actively pushed "just use Cloudflare" again and again and again.
It has been dead to me since the SSL cache vulnerability thing and the arrogance with which senior people expected others to solve their problems.
But consider how many people still do stupid things like use the default CDN offered by some third party library, or use google fonts directly; people are lazy and don't care.
It's not really. People are just very bad at putting the things around them into perspective.
Your power is provided by a power utility company. They usually serve an entire state, if not more than one (there are smaller ones too). That's "centralization" in that it's one company, and if they "go down", so do a lot of businesses. But actually it's not "centralized", in that 1) there are actually many different companies across the country/world, and 2) each company "decentralizes" most of its infrastructure to prevent massive outages.
And yes, power utilities have outages. But usually they are limited in scope and short-lived. They're so limited that most people don't notice when they happen, unless it's a giant weather system. Then if it's a (rare) large enough impact, people will say "we need to reform the power grid!". But later when they've calmed down, they realize that would be difficult to do without making things worse, and this event isn't common.
Large internet service providers like AWS, Cloudflare, etc, are basically internet utilities. Yes they are large, like power utilities. Yes they have outages, like power utilities. But the fact that a lot of the country uses them, isn't any worse than a lot of the country using a particular power company. And unlike the power companies, we're not really that dependent on internet service providers. You can't really change your power company; you can change an internet service provider.
Power didn't used to be as reliable as it is. Everything we have is incredibly new and modern. And as time has passed, we have learned how to deal with failures. Safety and reliability has increased throughout critical industries as we have learned to adapt to failures. But that doesn't mean there won't be failures, or that we can avoid them all.
We also have the freedom to architect our technology to work around outages. All the outages you have heard about recently could be worked around, if the people who built on them had tried:
- CDN goes down? Most people don't absolutely need a CDN. Point your DNS at your origins until the CDN comes back. (And obviously, your DNS provider shouldn't be the same as your CDN...)
- The control plane goes down on dynamic cloud APIs? Enable a "limp mode" that persists existing infrastructure to serve your core needs. You should be able to service most (if not all) of your business needs without constantly calling a control plane.
- An AZ or region goes down? Use your disaster recovery plan: deploy infrastructure-as-code into another region or AZ. Destroy it when the az/region comes back.
...and all of that just to avoid a few hours of downtime per year? It's likely cheaper to just take the downtime. But that doesn't stop people from piling on when things go wrong, questioning whether the existence of a utility is a good idea.
5 mins. of thought to figure out why these services exist?
Dialogue about mitigations/solutions? Alternative services? High availability strategies?
Nah! It's free to complain.
Me personally, I'd say those companies do a phenomenal job by being a de facto backbone of the modern web. Also Cloudflare, in particular, gives me a lot of things for free.
I have Cloudflare running in production and it is affecting us right now. But at least I know what is going on and how I can mitigate (e.g. disable Cloudflare as a proxy if it keeps affecting our services at skeeled).
Interestingly, also noticing that websites that use Cloudflare Challenge (aka "I'm not a Robot") are also throwing exceptions with a message as "Please unblock challenges.cloudflare.com to proceed" - even though it's just responding with an HTTP/500.
The state of error handling in general is woeful, they do anything to avoid admitting they're at fault so the negative screenshots don't end up on social media.
Blame the user or just leave them at an infinite spinning circle of death.
I check the network tab and find the backend is actually returning a reasonable error but the frontend just hides it.
Most recent one was a form saying my email was already in use, when the actual backend error returned was that the password was too long.
I think the site (front-end) thinks you have blocked the domain through DNS or an extension; and thus suggests you unblock it. It is unthinkable that Cloudflare captchas could go down /s.
I’d rather mitigate a DDoS attack on my own servers than deal with Cloudflare. Having to prove you’re human is the second-worst thing on my list, right after accepting cookies. Those two things alone have made browsing the web a worse experience than it was in the late 90s or early 2000s.
There's worse than having to prove (over and over and over again) that you are human: having your IP just completely blocked by Cloudflare zealous bot-filtering (and I use a plain mass market ISP in a developed country and not some shady network)
Alright kids, breathe...a DDoS attack isn't the end of the world, it's just the internet throwing a tantrum. If you really don't want to use a fancy protection provider, you can still act like a grown-up: get your datacenter to filter trash at the edge, announce a more specific prefix with BGP so you can shift traffic, drop junk with strict ACLs, and turn on basic rate limiting so bots get bored. You can also tune your kernel so it doesn't faint at SYN storms, and if the firehose gets too big, pop out a more specific BGP prefix from a backup path or secondary router so you can pull production away from the burning IP.
> pop out a more specific BGP prefix from a backup path or secondary router so you can pull production away from the burning IP.
This won't help against carpet bombing.
The only workable solution for enterprises is a combination of on-prem and cloud mitigation. Cloud to get all the big swaths of mitigation and to keep your pipe flowing, and on-prem to mitigate specific attack vectors like state exhaustion.
Very quickly you'll find this doesn't work. Your DC will just null your IP. You'll switch to a new one and the attackers will too, the DC will null that one. You won't win at this game unless you're a very sizeable organization or are just willing to wait the attackers out, they will get bored eventually.
Worrying about a DDoS on your tiny setup is like a brand-new dev stressing over how they'll handle a billion requests per second...cute, but not exactly a real-world problem for 99.99% of you. It's one of those internet boogeyman myths people love to panic about.
As much as this situation sucks, how do you plan to "mitigate a DDoS attack on my own servers". The reason I use Cloudflare is to use it as a proxy especially for DDOS attacks if they do occur. Right now, our services are down and we are getting tons of customer support tickets (like everyone else) but it is a lot easier to explain the the whole world is down vs its just us.
> During our attempts to remediate, we have disabled WARP [their VPN service] access in London. Users in London trying to access the Internet via WARP will see a failure to connect.
Posted 4 minutes ago. Nov 18, 2025 - 13:04 UTC
> We have made changes that have allowed Cloudflare Access [their 'zero-trust network access solution'] and WARP to recover. Error levels for Access and WARP users have returned to pre-incident rates.
> We have re-enabled WARP access in London.
> We are continuing to work towards restoring other services.
> Posted 12 minutes ago. Nov 18, 2025 - 13:13 UTC
Now I'm really suspicious that they were attacked...
Someone running cloudflared accidentally advertising a critical route into their Warp namespace and somehow disrupting routes for internal Cloudflare services doesn't seem too far fetched.
>A spokesperson for Cloudflare said: “We saw a spike in unusual traffic to one of Cloudflare’s services beginning at 11.20am. That caused some traffic passing through Cloudflare’s network to experience errors. While most traffic for most services continued to flow as normal, there were elevated errors across multiple Cloudflare services.
>“We do not yet know the cause of the spike in unusual traffic. We are all hands on deck to make sure all traffic is served without errors. After that, we will turn our attention to investigating the cause of the unusual spike in traffic.”
"Unusual spike of traffic" can just be errant misconfiguration that causes traffic spikes just from TCP retries or the like. Jumping to "cyber attack" is eating up Hollywood drama.
In most cases, it's just cloud services eating shit from a bug.
I’ve written before on HN about when my employer hired several ex-FAANG people to manage all things cloud in our company.
Whenever there was an outage they would put up a fight against anyone wanting to update the status page to show the outage. They had so many excuses and reasons not to.
Eventually we figured out that they were planning to use the uptime figures for requesting raises and promos as they did at their FAANG employer, so anything that reduced that uptime number was to be avoided at all costs.
Are there companies that actually use their statuspage as a source of truth for uptime numbers?
I think it's way more common for companies to have a public status page, and then internal tooling that tracks the "real" uptime number. (E.g. Datadog monitors, New Relic monitoring, etc)
I don’t know, but I will say that this team that was hired into our company was so hyperfocused on any numbers they planned to use for performance reviews that it probably didn’t matter which service you chose to measure the website performance. They’d find a way to game it. If we had used the internal devops observability tools I bet they would have started pulling back logging and reducing severity levels as reported in the codebase.
It’s obviously not a problem at every company because there are many companies who will recognize these shenanigans and come down hard on them. However you could tell these guys could recognize any opportunity to game the numbers if they thought those numbers would come up at performance review time.
Ironically our CEO didn’t even look at those numbers. He used the site and remembered the recent outages.
It's because if you automate it, something could/would happen to the little script that defines "uptime," and if that goes down, suddenly you're in violation of your SLA and all of your customers start demanding refunds/credits/etc. when everything is running fine.
Or let's say your load balancer croaks, triggering a "down" status, but it's 3am, so a single server is handling traffic just fine? In short, defining "down" in an automated way is just exposing internal tooling unnecessarily and generates more false positives than negatives.
Lastly, if you are allowed 45 minutes of downtime per year and it takes you an hour to manually update the status page, you just bought yourself an extra hour to figure out how to fix the problem before you have to start issuing refunds/credits.
I found GitHub's old "how many visits to this status page have there been recently" graph on their status page to be an absurdly neat solution to this.
Requires zero insight into other infrastructure, absolutely minimal automation, but immediately gives you an idea whether it's down for just you or everybody. Sadly now deceased.
I like that https://discordstatus.com/ shows the API response times as well. There's times where Discord will seem to have issues, and those correlate very well with increased API response times usually.
Reddit Status used to show API response times way back in the day as well when I used to use the site, but they've really watered it down since then. Everything that goes there needs to be manually put in now AFAIK. Not to mention that one of the few sections is for "ads.reddit.com", classic.
They are manual AND political (depending on how big the company is). Because having a dashboard go to red usually has a bunch of project work behind it.
Yeah, this is something people think is super easy to automate, and it is for the most basic implementation of something like a single test runner. The most basic implementation is prone to false positives, and as you say, breaking when the rest of your stuff breaks.
You can put your test runner on different infrastructure, and now you have a whole new class of false positives to deal with. And it costs you a bit more because you're probably paying someone for the different infra.
You can put several test runners on different infrastructure in different parts of the world. This increases your costs further. The only truly clear signals you get from this are when all are passing or all are failing. Any mixture of passes and fails has an opportunity for misinterpretation. Why is Sydney timing out while all the others are passing? Is that an issue with the test runner or its local infra, or is there an internet event happening (cable cut, BGP hijack, etc) beyond the local infra?
And thus nearly everyone has a human in the loop to interpret the test results and make a decision about whether to post, regardless of how far they've gone with automation.
I know this is bad, and some people's livelihood and lives rely on critical infrastructure, but when these things happen, I sometimes think GOOD!, let's all just take a breather for a minute yeh? Go outside.
One of the things that i didnt like about cloudflare MITM as a service is their requirement if you want SSL/CDN that you must use their DNS.
Overconcentration of infra within one single pint of disruption with no easy outs when the stack tips over.
Sadly i dont see any changes or rethink to be more decentralised even after this outage.
Yeah they keep re-inforcing bad vendor lockin practices. id guess the number of free users surpass the paying ones , and situations like these leave them all unable to recover.
Interesting(unnerving?) to see a number of domain registrars that offer their own DNS services utilize at least some kind of Cloudflare service for at least their own web fronts. Did a check on 6 registrar sites I currently interact with and half were down(Namecheap/Spaceship, Name, Dynadot) and up(Porkbun, Gandi, GoDaddy).
I just considered moving from Namecheap to Porkbun as Namecheap is down, but Porkbun use Cloudflare for their CAPTCHA meaning I'm unable to signup and I assume log in as well, so also no good.
Only true if your audience doesn't require Edge distribution, also if your Origin can handle the increased load and security issues, also if you don't use any advanced features (routing, edge compute...).
If your site is only hosted on one server and it catches fire, you can swiftly reinstall on a new server and change the IP your domain is pointing to, too... Still a single point of failure.
Yes, everything in the world is a single point of failure and has always been, if we look at things that way. But if it can be remedied quickly, then it's not a huge concern.
If I had pointed my name servers somewhere else, then that of course would be the new single point of failure. You can't escape it, no matter how much hacker snark and down votes you have.
Just checked INWX from here in Germany. I was able to log in and get to my DNS records. Just if you should be looking for an alternative after all this.
Even if he blocked it by accident, that is not a reason to shout.
Shouting will not prevent errors, and you are only creating a hostile work environment where not acting is better than the risk of making a mistake and triggering an aggressive response from your part.
That's why I run my server on 7100 chips made for me by Sam Zeloof in his garage on a software stack hand coded by me, on copper I ran personally to everyone's house.
You are joking but working on making decentralization more viable would indeed be more healthy than throwing hands up and accepting Cloudflare as the only option.
There was an article on HN a few days back about how companies like this are influencing the overall freedom of the web (I missed the source) and their own way of doing things. Other examples of influence I see similarly are of Vercel, like with enterprise. Even a few days back, we saw AWS.
What would the Internet's architecture have to look like for DDOS'ing to be a thing of the past, and therefore Cloudflare to not be needed?
I know there are solutions like IPFS out there for doing distributed/decentralised static content distribution, but that seems like only part of the problem. There are obviously more types of operation that occur via the network -- e.g. transactions with single remote pieces of equipment etc, which by their nature cannot be decentralised.
Anyone know of research out there into changing the way that packet-routing/switching works so that 'DDOS' just isn't a thing? Of course I appreciate there are a lot of things to get right in that!
What would that look like? A network with built-in rate & connection limiting?
The closest thing I can think of is the Gemini protocol browser. It uses TOFU for authentication, which requires a human to initially validate every interaction.
It's impossible to stop DDoS attacks because of the first "D".
If a botnet gets access through 500k IP addresses belonging to home users around the world, there's no way you could have prepared yourself ahead of time.
The only real solution is to drastically increase regulation around security updates for consumer hardware.
Maybe that's the case, but it seems like this conclusion is based on the current architecture of the internet. Maybe there are ways of changing it that mean these issues are not a thing!
It's not an architectural problem. It's a fundamental issue with trust and distributed systems. The same issues occur in physical spaces, like highways.
The core issue is that hackers can steal the "identity" of internet customers at scale, not that the internet allows unauthenticated traffic.
Why fridge need to have rights to initiate connection to something on internet ?
Why fridge need to even be reachable from the internet ?? You should have some AI agent for managing your "smart" home. At least it's how sci-fi movies/games show it, eg. Iron man or Starcraft II ;)
I was thinking of a reaction to a DDOS event, so those devices are flagged as being infected. You could prevent future attacks if those devices are ignored until they get fixed.
That is what ISPs do these days. Most botnet members don't end up spamming a lot of requests, usually just a few before they are blocked.
The issue with DDOS is specifically with the distributed nature of it. One single bot of a botnet is pretty harmless, it's the cohesive whole that's the problem.
To make botnets less efficient you need to find members before they do anything. Retroactively blocking them won't really help, you'll just end up cutting off internet for regular people, most of whom probably don't even know how to get their fridge off of their local network.
There's not really any easy fix for this. You could regulate it, and require a license to operate IoT devices with some registration requirement + fines if you don't keep them up to date. But even that will probably not solve the issue.
Works for static content and databases, but I don't think it works for applications where there is by necessity only one destination that can't be replicated (e.g. a door lock).
> Investigating - Cloudflare is aware of, and investigating an issue which potentially impacts multiple customers. Further detail will be provided as more information becomes available.
Things are back up (a second time) for me.
Cloudflare have updated their status page now to reflect the problems now. It doesn’t sound like they are confident the problem is fully fixed yet.
I got several emails from some uptime monitors I setup due to failing checks on my website and funnily enough I cannot log into any of them.
BetterStack, InStatus and HetrixTools seemingly all use Cloudflare on their dashboards, which means I can't login but I keep getting "your website/API is down" emails.
Update: I also can't login to UptimeRobot and Pulsetic. Now, I am getting seriously concerned about the sheer degree of centralization we have for CDNs/login turnstiles on Cloudflare.
In the beginning I thought my IP fell on the wrong side of Cloudflare and thought I was being blocked from ~80% of the internet. I was starting to panic
Update - The team is continuing to focus on restoring service post-fix. We are mitigating several issues that remain post-deployment.
Nov 18, 2025 - 15:40 UTC
Trying to figure out if this observation was intended to frame it so that it's less|same|more scary. The effect is more, but it sounds like the intention was less.
I'd love to read an article describing the HN setup. Seems that they got a lot of things right - self registration, influx of people during outages and plenty others. Admins, if you see this, please write about your craft!
We handle ~2M requests per second and CF eliminates about ⅔ of those. We need CF or something like it. Multi edge is harder than it sounds at very large scale.
There are still alternatives like Bunny https://status.bunny.net/history (may not be for everyone, but I like to post the CF alternatives so it becomes ever so slightly less of a default)
Sadly, I can report that this has brought down 2 of the major Mastodon nodes in the United Kingdom.
Happily, the small ones that I also use are still going without anyone apparently even noticing. At least, the subject has yet to reach their local timelines at the time that I write this.
2 of the other major U.K. nodes are still up, too.
> However its number of monthly active users have only grown since 2020.
Like everywhere it is mostly bots.
Look at HN frontpage, there used to be 1-2 Twitter post per day. Now it is barely per week. End even those are usually just from two accounts (Karpathy and Carmack).
This is crazy. The internet has so much direct and transitive dependency on Cloudflare today. Pretty much the #1 dev slacking excuse today is no longer code compiling but cloudflare is down.
It's hard not to use Cloudflare at least for me: good products, "free" for small projects, and if Cloudflare is down no one will blame you since the internet is down.
Well, no. If they are unreliable to the point of being an outlier when compared to the alternatives then people will switch. At this stage they’re not an outlier.
Maybe not, but they are approaching it. I wouldn't use it for anything funded with my own cash, I no longer recommend it as a first choice, but I'm not suggesting it gets replaced yet. It's somewhat in the 'legacy tech' category now in terms of how I perceive it and deal with it.
> if Cloudflare is down no one will blame you since the internet is down.
But this is not really the case. When Azure/AWS were down, same as this one with Cloudflare: significant amount of web was down but most of it was not. It just makes more obvious which provider you use.
There’s certainly a business case for “which nines” after the talk of n nines. You ideally want to be available when your competitor, for instance, is not.
Setting up a replica and then pointing your api requests at it when cloudflare request fails is trivial. This way if you have a SPA and as long as your site/app is open the users won't notice.
The issue is DNS since DNS propagation takes time. Does anyone have any ideas here?
> Setting up a replica and then pointing your api requests at it when cloudflare request fails is trivial.
Only if you're doing very basic proxy stuff. If you stack multiple features and maybe even start using workers, there may be no 1:1 alternatives to switch to. And definitely not trivially.
You have the power to not host your own infrastructure on aws and behind cloudflare, or in the case of an employer you have the power to fight against the voices arguing for the unsustainable status quo.
If you need DDoS mitigation then you essentially need to rely on a third party. Every third party will have inevitable downtime. For many it’s just whether you’d prefer to be down while everyone else is down or not.
The HN crowd in particular absolutely has a say in this, given the amount of engineering leads, managers, and even just regular programmers/admins/etc that frequent here - all of whom contribute to making these decisions.
Think about this rationally. If Cloudflare doesn't fix it within reasonable time, you can just point to different name servers and have your problem fixed in minutes.
So why be on Cloudflare to start with? Well, if you have a more reliable way then there's no reason. If you have a less reliable way, then you're on average better off with Cloudflare.
Well I can't change my NS since it's on Cloudflare too but besides that my personal opinion was not about this outage in particular but more the default approach of some websites that don't need all this tech (yes I really was out of groceries)
> Is Cloudflare your domain registrar? In that case, yes I think you should think about being less dependent on them.
And why I should overthink my architecture now? If I had to manage redundant systems and keep track of circular dependencies I just could keep managing my infra the old way, no?
I'm being sarcastic here, obviously, but really one of the selling point for cloud back in the day it was "you don't have to care about those details". You just need to care about other details, now.
I am personally really happy with Cloudflare for domains, pages and dns, I don't run critical stuff but some websites are and they should not be lazy about it
We? I am not using it. I never used it and I will not use it. People should learn how to work with firewall, setup a simple ModSecurity WAF and stop using this bullshit. Almost everything goes through cloudflare and cloudflare also does TLS fronting for websites so basically cloudflare is MITM spying proxy but no one seem to care. :/
It's the web-scrapers. I run a tiny little mom and pop website, and the bots were consistently using up all of my servers' resources. Cloudfare more or less instantly resolved it.
You mean you outsourced to Cloudflare the decision on who is allowed to view your website. That could be well-intentioned, but it's a risky thing to do, and I would not to outsource that decision. Especially as I wouldn't know who failed to get to my website as there is no way to appeal the decision.
As a side note, what does your site do that it's possible to use up all server resources? Computers are stupid fast these days. I find it's really difficult to build something that doesn't scale to at least multiple hundreds of requests per second.
I’ve been DDoS’d countless times running a small scale, uncontroversial SaaS. Without them I would’ve had countless downtime periods with really no other way to mitigate.
There's plenty of DDoS if you're dealing with people petty enough.
The VPS I use will nuke your instance if you run a game server. Not due to resource usage, but because it attracts DDoS like nothing else. Ban a teen for being an asshole and expect your service to be down for a week. And there isn't really Cloudflare for independent game servers. There's Steam Networking but it requires the developer to support it and of course Steam.
> And there isn't really Cloudflare for independent game servers
And yet game servers still work fine. Which answers this subthread's question ("how likely is it to get DDoSed if you don't have Cloudflare"), answer: not very likely, it happens once in a while at most.
Have you tried Anubis or similar tools? I've had similar issues with bot scraping of a forum taking all server resources, and using PoW challenge solved the problem.
I wrote the below to explain to our users what was happening, so apologies if the language is too simple for a HN reader.
- 0630, we switched our DNS to proxy through CF, starting the collection of data, and implemented basic bot protections
- Unfortunately whatever anti-bot magic they have isn't quite having the effect, even after two hours.
- 0830, I sign in and take a look at the analytics. It seems like <SITE NAME> is very popular in Vietnam, Brazil, and Indonesia.
- 0845, I make it so users from those countries have to pass a CF "challenge". This is similar to a CAPTCHA, but CF try to make it so there's no "choosing all the cars in an image" if they can help it.
- So far 0% of our Asian audience have passed a challenge.
I was arrested by Interpol in 2018 because of warrants issued by the NCA, DOJ, FBI, J-CAT, and several other agencies, all due to my involvement in running a DDoS-for-hire website. Honestly, anyone can bypass Cloudflare, and anyone that want to take your website down - will take it down. It's just that luckily for all of us most of the DDoS-4-hire websites are down nowadays but there are still many botnets out there that will get past basically any protection and you can get access to them for basically $5.
Like? Aside from scanning DNS records (assuming the protected IP is in there somewhere) or scanning the entire IPv4 (assuming the server responds to non CloudFlare requests), I can't think of any. And both methods are simple to protect against.
One minute, what? Can you elaborate on that. I have loads of questions. What exactly were you doing? What consequences did you face? How come you are talking about it?
No but because all of us were arrested in 2018 for running DDoS-4-hire services. Bypassing cloudflare is very easy and I still can fry any of your websites (if i wanted to, just like any other skid)
There are plenty of alternatives to protect against DDoSing, people like convenience though. “Nobody gets fired for choosing Microsoft/Cloudflare”. We have a culture problem
Because of 2018 operation "Power OFF" but it's still pretty easy to take anything down.
Hetzner has the WEAKEST DDoS protection out of ANYTHING out there - Arbor sucks.
Send me your website url and I'll keep it down for DAYS and whenever you cry to hetzner I'll just fry it again, it's that easy and that's why they're the cheapest - because everyone ran away from them back then.
I run a few websites with moderate traffic (~900K daily page views total) on the same VPS and never had an issue with DDOS. Is this specific to some industries?
My small SaaS app has been DDoSed a handful of times, always accompanied by an email asking for a ransom in the form of bitcoin.
The first time we switched to Cloudflare which saved us. Even with Cloudflare, the DDoS attempts are still damaging (the site goes down, we use Cloudflare to block the endpoints they're targeting, they change endpoints, etc.) but manageable. Without Cloudflare or something like it, I think it's possible that we'd be out of business.
Honestly it kinda is. Ai bots scrape everything now, social media means you can go viral suddenly, or you make a post that angers someone and they launch an attack just because. I default to cloudflare, because like an umbrella I might just be carrying it around most of the time, but in the case of a sudden downpoor it's better than getting wet.
It's not super common, but common enough that I don't want to deal with it.
The other part is just how convenient it is with CF. Easy to configure, plenty of power and cheap compared to the other big ones. If they made their dashboard and permission-system better (no easy way to tell what a token can do last I checked), I'd be even more of a fan.
If Germany's Telekom was forced to peer on DE-CIX, I'd always use CF. Since they aren't and CF doesn't pay for peering, it's a hard choice for Germany but an easy one everywhere else.
Cloudflare seems to have degrated performance. Half the requests for my site throw cloudflare 500x errors, the other half work fine.
However the https://www.cloudflarestatus.com/ does not really mention anything relevant. What's the point of having a status page if it lies ?
Update Ah I just checked the status and now I get a big red warning (however the problem existed for like 15 minutes before 11:48 UTC):
> Investigating - Cloudflare is aware of, and investigating an issue which potentially impacts multiple customers. Further detail will be provided as more information becomes available. Nov 18, 2025 - 11:48 UTC
> What's the point of having a status page if it lies ?
Status pages are basically marketing crap right now. The same thing happened with Azure where it took at least 45 minutes to show any change. They can't be trusted.
Please read my comment again including the update:
For 15 minute cloudflare wasn't working and the status page did not mentioned anything. Yes, right now the status page mentions the serious network problem but for some time our pages were not working and we didn't know what was happening.
So for ~ 15 minutes the status page lied. The whole point of a status page is to not lie, i.e to be updated automatically when there are problem and not by a person that needs to get clearance on what and how to write.
I can now imagine a scenario where everyone has become so dependent on the AI tool that it going down could turn into an unanticipated black start event for the entire internet.
I sense a great disturbance in the force... As if millions of cringefluencers suddenly cried out in terror cause they had to come up with an original thought.
It's insane to me that big internet uptime monitoring tools like Pingdom and Downdetector both seem to rely on Cloudflare, as both of those are currently unavailable as well.
There are no truly automated status pages. It's an impossible problem. I mean that seriously. At scale you're collecting 100s of thousands (or mms) of metrics/spans/logs across 10s or 100s of loosely coupled systems. Building a system that can accurately analyze these and assess what the status page should say, in real time, without human intervention, is just not possible with current technology.
Even just the basic question of "are we down or is our monitoring system just having issues" requires a human. And it's never "are we down", because these are distributed systems we're talking about.
If service X goes down entirely, does that warrant a status page update? Yes? Turns out system X is just running ML jobs in the background and has no customer impact.
If service Z's p95 response latency jumps from 10ms to 1500ms for 5 minutes, 500s spike at the same time, but overall 200s rate is around 98%, are we down? is that a status page update? Is that 1 bad actor trying to cause issues? Is that indicative of 2,000 customers experiencing an outage and the other 98,000 operating normally? Is that a bad rack switch that's causing a few random 500s across the whole customer base and the service will reject that node and auto-recover in a moment?
I can answer that - once the lawyers take interest in your SLAs, you need to check with them if this is really an incident. Otherwise, you might lose some contract money and nobody wants that.
The main bike rental Velib in Paris has the app not working, but the bikes can be taken with NFC. However, my station, which is always full at this time, is now empty, with only 2 bad bikes. It maybe related. Yet, push notifications are working.
I'm going to take the metro now and thinking how long do we have until the entire transit network goes down because of a similar incident.
I think you should give me a credit for all the income I lost due to this outage. Who authorized a change to the core infrastructure during the period of the year when your customers make the most income? Seriously, this is a management failure at the highest levels of decision-making. We don't make any changes to our server infrastructure/stack during the busiest time of the year, and neither should you. If there were an alternative to Cloudflare, I'd leave your service and move my systems elsewhere.
Later today or tomorrow there's going to be a post on HN pointing to Cloudflare's RCA and multitudes here are going to praise CF for their transparency. Let's not forget that CF sucks and took half the internet down for four hours. Transparency or no, this should not be happening.
Alot of things shouldnt be happening. Fact is that no one forced half the internet to make CF their point of failure. The internet should ask themselves if that was the right call
Speaking of 5 9s, how would you achieve 5 9s for a basic CRUD app that doesn't need to scale, but still be globally accessible? No auth, micro services, email or 3rd party services. Just a classic backend connected to a db (any db tech, hosted wherever), that serves up some html.
You probably cannot achieve this with a single node, so you'll at least need to replicate it a few times to combat the normal 2-3 9s you get from a single node. But then you've got load balancers and dns, which can also serve as single point of failure, as seen with cloudflare.
Depending on the database type and choice, it varies. If you've got a single node of postgres, you can likely never achieve more than 2-3 9s (aws guarantees 3 9s for a multi-az RDS). But if you do multi-master cockroach etc, you can maybe achieve 5 9s just on the database layer, or using spanner. But you'll basically need to have 5 9s which means quite a bit of redundancy in all the layers going to and from your app and data. The database and DNS being the most difficult.
Reliable DNS provider with 5 9s of uptime guarantees -> multi-master load balancer each with 3 9s, -> each load balancer serving 3 or more apps each with 3 9s of availability, going to a database(s) with 5 9s.
This page from google shows their uptime guarantees for big tables, 3 9s for a single region with a cluster. 4 9s for multi cluster and 5 9s for multi region
Part of the up-time solution is keeping as much of your app and infrastructure within your control, rather than being at the behest of mega-providers as we've witnessed in the past month: Cloudflare, and AWS.
Probably:
- a couple of tower servers, running Linux or FreeBSD, backed up by a UPS and an auto-run generator with 24 hours worth of diesel (depending on where you are, and the local areas propensity for natural disasters - maybe 72 hours),
- Caddy for a reverse proxy, Apache for the web server, PostgreSQL for the database;
- behind a router with sensible security settings, that also can load-balance between the two servers (for availability rather than scaling);
- on static WAN IPs,
- with dual redundant (different ISPs/network provider) WAN connections,
- a regular and strictly followed patch and hardware maintenance cycle,
- located in an area resistant to wildfire, civil unrest, and riverine or coastal flooding.
I'd say that'd get you close to five 9s (no more than ~5 minutes downtime per year), though I'd pretty much guarantee five 9s (maybe even six 9s - no more than 32 seconds downtime per year) if the two machines were physically separated from each other by a few hundred kilometres, each with their own supporting infrastructure above, sans the load balancing (see below), through two separate network routes.
Load balancing would become human-driven in this 'physically separate' example (cheaper, less complex): if your-site-1.com fails, simply re-point your browser to your-site-2.com which routes to the other redundant server on a different network.
The hard part now will be picking network providers that don't use the same pipes/cables, i.e. they both use Cloudflare, or AWS...
Keep the WAN IPs written down in case DNS fails.
PostgreSQL can do master-master replication, but it's a pain to set up I understand.
what if you could create a super virtual server of sorts. imagine a new cloud provider like vercel but called something else. what this provider does is when you create a server on their service, they create 3 services, one on aws, one on gcp and one on azure. behind the scenes they are 3 separate servers but to the end user they are a single server. the end user gets to control how many cloud providers are involved. when aws goes down, no worries, it switches to the part with gcp on
I didn’t see anyone comment this directly, but something these recent outages made me wonder, having spent a good chunk of my career in 24/7 tech support, is that I can’t even fathom the amount of people who have been:
- restarting their routers and computers instead of taking their morning shower, getting their morning coffee, taking their medication on time because they’re freaking out, etc.
- calling ISPs in a furious mood not knowing it’s a service in the stack and not the provider’s fault (maybe)
- being late for work in general
- getting into arguments with friends and family and coworkers about politics and economics
- being interrupted making their jerk chicken
> A fix has been implemented and we believe the incident is now resolved. We are continuing to monitor for errors to ensure all services are back to normal. Posted 3 minutes ago. Nov 18, 2025 - 14:42 UTC
Seems like they think they've fixed it fully this time!
Close! They just updated their states and it's back to working on a fix
Update - Some customers may be still experiencing issues logging into or using the Cloudflare dashboard. We are working on a fix to resolve this, and continuing to monitor for any further issues.
Nov 18, 2025 - 14:57 UTC
Phew, my latest 3h30 workshop about Obsidian was saved.
I recorded it this morning, not knowing about the Cloudflare issue (probably started while I was busy). I'm using Circle.so and they're down (my community site is now inaccessible). Luckily, they probably use AWS S3 or similar to host their files, so that part is still up and running.
Meanwhile all my sites are down. I'll just wait this one out, it's not the end of the world for me.
My GitHub actions are also down for one of my project because some third-party deps go through Cloudflare (Vulkan SDK). Just yesterday I was thinking to myself: "I don't like this dependency on that URL...". Now I like it even less
I've been considering Cloudflare for caching, DDoS protection and WAF, but I don't like furthering the centralization of the Web. And my host (Vultr) has had fantastic uptime over the 10 years I've been on them.
How are others doing this? How is Hacker News hosted/protected?
I got an email saying that my OpenAI auto-renewal failed, my credits have run out. I go to OpenAI to reauthorize the card, and I can't login because OpenAI uses Cloudflare for "verifying you are a human" that goes in infinite loop. Great.
For anyone reading this who desperately needs their website up, you can try this: If you manage to get to your Cloudflare DNS settings and disable the "Proxy status (Proxied)" feature (the orange cloud), it should start working again.
Be aware that this change has a few immediate implications:
- SSL/TLS: You will likely lose your Cloudflare-provided SSL certificate. Your site will only work if your origin server has its own valid certificate.
- Security & Performance: You will lose the performance benefits (caching, minification, global edge network) and security protections (DDoS mitigation, WAF) that Cloudflare provides.
This will also reveal your backend internal IP addresses. Anyone can find permanent logs of public IP addresses used by even obscure domain names, so potential adversaries don't necessarily have to be paying attention at the exact right time to find it.
Unfortunately, this will also expose your IP address, which may leave you vulnerable even when the WAF and DDoS protections come back up (unless you take the time to only listen for Cloudflare IP address ranges, which could still take a beefy server if you're having to filter large amounts of traffic).
Recently my multiple VPN server nodes just randomly cannot connect to cloudflare CDN IPs, from different provider VPS, while the Host Linux network does not have the issue; vpp share the same address with Linux and use tc stateless NAT to do the trick.
I finally work around this by change the tcp options sent by vpp tcp stack.
But the whole thing made me worry there must be something deployed which cause this issue.
But I do not think that related with this network issue, it just reminds me the above, I feel there are frequently new articles about cloudflare networking, maybe new method or new deployment sort of related high probability of issues
Looking forward to seeing their RCA. I'm guessing it's going to be glossy in terms of actual customer impact. "We didn't go offline, we just had 100% errors. For 60 minutes."
I would love to see a competition for the most banal thing that went wrong as a result of this. For example, I’m pretty sure the reason my IKEA locker wouldn’t latch shut was because the OS had hung while talking to a Cloudflare backend.
This reminds me that I really like self-hosting. While it is true that many of things do not work, all my services do work. It has some tradeoffs of course.
DigitalOcean + Gandi means nothing I run is down. Amazing. We depend far too greatly on centralised services where we deem the value of reputation and convenience exceeds the potential downsides and then the world pays for it. I think we have to feel a lot more of this pain before regulation kicks in to change things because the reality is people don't change. The only thing you can personally do is run a lot of your own stuff for things you can.
The sites I host on Cloudflare are all down. Also, even ChatGPT was down for a while, showing the error: "Please unblock challenges.cloudflare.com to proceed."
What do we actually lose going from cloud back to ground?
The mass centralization is a massive attack vector for organized attempts to disrupt business in the west.
But we’re not doing anything about it because we’ve made a mountain at of a molehill. Was it that hard to manage everything locally?
I get that there’s plenty of security implications going that route, but it would be much harder to bring down t large portions of online business with a single attack.
> What do we actually lose going from cloud back to ground?
A lot of money related to stuff you currently don't have to worry about.
I remember how shit worked before AWS. People don't remember how costly and time consuming this stuff used to be. We had close to 50 people in our local ops team back in the day when I was working with Nokia 13 years ago. They had to deal with data center outages, expensive storage solutions failing, network links between data centers, offices, firewalls, self hosted Jira running out of memory, and a lot of other crap that I don't spend a lot of time about worrying with a cloud based setup. Just a short list of stuff that repeatedly was an issue. Nice when it worked. But nowhere near five nines of uptime.
That ops team alone cost probably a few million per year in salaries alone. I knew some people in that team. Good solid people but it always seemed like a thankless and stressful job to me. Basically constant firefighting while getting people barking at you to just get stuff working. Later a lot of that stuff moved into AWS and things became a lot easier and the need for that team largely went away. The first few teams doing that caused a bit of controversy internally until management realized that those teams were saving money. Then that quickly turned around. And it wasn't like AWS was cheap. I worked in one of those teams. That entire ops team was replaced by 2-3 clued in devops people that were able to move a lot faster. Subsequent layoff rounds in Nokia hit internal IT and ops teams hard early on in the years leading up to the demise of the phone business.
Yeah, people have such short memories for this stuff. When we ran our own servers a couple of jobs ago, we had a rota of people who'd be on call for events like failing disks. I don't want to ever do that again.
In general, I'm much happier with the current status of "it all works" or "it's ALL broken and its someone else's job to fix it as fast as possible"!
Not saying its perfect but neither was on-prem/colocation
Yesterday I decided to finally write my makefiles to "mirror" (make available offline) the docs of the libraries I'm using. doc2dash for sphinx-enabled projects, and then using dash / zeal.
Then I was like... "when did I last time fly for 10+ hours and wanted to do programming, etc, so that I need offline docs?" So I gave up.
Today I can't browse the libs' docs quickly, so I'm resuming the work on my local mirroring :-)
There is an election in Denmark today, I wonder if this will affect that. The governments website is not accessible at the moment because it uses Cloudflare.
one way to mitigate DDoS is to enforce source IP checks on the way OUT of a datacenter (egress).
sure there are botnets, infected devices, etc that would conform to this but where does the sheer power of a big ddos attack come from? including those who sell it as a service. they have to have some infrastructure in some datacenter right?
make a law that forces every edge router of a datacenter to check for source IP and you would eliminate a very big portion of DDoS as we know it.
until then, the only real and effective method of mitigating a DDoS attack is with even more bandwidth. you are basically a black hole to the attack, which cloudflare basically is.
alright, what you are proposing is kind of hard to do.
Source routing is not easy, and source validations is even harder.
and what prevents me, as a abuse hoster or "bad guy" from just announcing my own IP space directly on a transit or IXP?
You might say, the IXP should do source checking aswell, but what if ipspace is distributed/anycasted across multiple ASN's/ on the IXP?
Also, if you add multiple egress points distributed across different routing domains, it gets complicated fast.
Does my transit upstream need to do source validation of my IP space? What about their upstream? Also, how would he know which IPspace belongs to which ASN's considering the allocation of ASN numbers and IP space is distributed across different organisations across the globe. (some of which are more malicious/non function than others[0]). Source routing becomes extremly complex because there is no single, universal mapping between IP space and ASN's they belong too.
The biggest attacks literally come from botnets. There’s not a lot coming from infrastructure services precisely because these services are incentivized to shut that shit down. At most it would be used as the control plane which is how people attempt to shut down the botnets.
Our national transit agency is apparently a customer.
The departure tables are borked, showing incorrect data, the route map stopped updating, the website and route planner are down, and the API returns garbage. Despite everything, the management will be pleased to know the ads kept on running offline.
Why would you put a WAP between devices you control and your own infra, God knows.
Why do people use the reverse proxy functionality of Cloudflare? I've worked at small to medium sized businesses that never had any of this while running public facing websites and they were/are just fine.
Same goes for my personal projects: I've never been worried about being targeted by a botnet so much that I introduce a single point of failure like this.
Any project that starts gaining any bit of traction get's hammered with bots (the ones that try every single /wp url even tough you don't even use Wordpress), frequent DDoS attacks, and so on.
I consider my server's real IP (or load balancer IP) as a secret for that reason, and Cloudflare helps exactly with that.
Everything goes through Cloudflare, where we have rate limiters, Web firewall, challenges for China / Russian inbound requests (we are very local and have zero customers outside our country), and so on.
people think that running nodejs servers are a good idea, and those fall over if there's ever so much as a stiff breeze, so they put cloudflare in front and call it a day.
It gives really good caching functionality so you can have large amounts of traffic and your site can easily handle it. Plus they don't charge for egress traffic.
What exactly are you serving that bot traffic affects your quality of service?
I've seen an RPi serve a few dozen QPS of dynamic content without issue... The only service I've had actually get successfully taken down by benign bots is a Gitea-style git forges (which was 'fixed' by deploying Anubis in front of it).
>Cloudflare is aware of, and investigating an issue which potentially impacts multiple customers. Further detail will be provided as more information becomes available.
I had two completely unrelated tabs open (https://twitter.com and https://onsensensei.com) both showing the same error. Opened another website, same error. Kinda funny to see how much of entire web is ran on CloudFlare nowadays.
The non profit I volunteer at is unreachable. It gives a cloudflare error page which is sort of helpful. It tells me the the site is ok but cloudflare has an 500.
It’s been great, but I always wonder when a company starts doing more than it’s initially calling. There have been a ton of large attacks, tons of bot scrappers so it’s the Wild West.
yes they're spreading themselves very thin with lots of new releases/products - but they will lose a lot of customers if their reliability comes into question
So they broke the internet. Nice!
Never seen so many sites not working.
Never seen so many desktop app suddenly stop working.
I don't want to be the person responsible for this.
And this again has thought me it's better to no rely on external services. Even though they seem to big to fail.
Down, but the linked status page shows mostly operational, except for "Support Portal Availability Issues" and planned maintenance. Since it was linked, I'm curious if others see differently.
edit: It now says "Cloudflare Global Network experiencing issues" but it took a while.
Cloudflare runs a high demand service, and the centralisation does deserve scrutiny. I think a good middle ground I’ll adopt is self hosting critical services and then when they have an outage redirect traffic to a Cloudflare outage banner.
Cloudflare Dashboard/Clicky clicky UI is down. I really appreciate that their API is still working. Small change in our Terraform configuration and now I can go lunch in peace knowing our clients at skeeled can keep working if wanted:
Meanwhile my Wordpress blog on DigitalOcean is up. And so is DigitalOcean.
My ISP is routing public internet traffic to my IPs these days. What keeps me from running my blog from home? Fear of exposing a TCP port, that's what. What do we do about that?
Depending on the contract it might not be allowed to run public network services from your home network.
I had a friend doing that and once his site got popular the ISP called (or sent a letter? don't remember anymore) with "take this 10x more expensive corporate contract or we will block all this traffic".
In general why the ISPs don't want you to do that (in addition to way more expensive corporate rates) is the risk of someone DDoS that site which could cause issues to large parts of their domestic customers (and depending on the country be liable to compensate those customers for not providing a service they paid for)
> Our Engineering team is actively investigating an issue impacting multiple DigitalOcean services caused by an upstream provider incident. This disruption affects a subset of Gen AI tools, the App Platform, Load Balancer, Spaces and provisioning or management actions for new clusters. Existing clusters are not affected. Users may experience degraded performance or intermittent failures within these services.
> We acknowledge the inconvenience this may cause and are working diligently to restore normal operations. Signs of recovery are starting to appear, with most requests beginning to succeed. We will continue to monitor the situation closely and provide timely updates as more information becomes available. Thank you for your patience as we work towards full service restoration.
Yeah, DigitalOcean and Dreamhost are both up. I actually self-host on 2Gig fibre service, and all my stuff is up, except I park everything behind Cloudflare since there is no way I could handle a DDoS attack.
No logging in to Cloudflare Dash, no passing Turnstile (their CAPTCHA Replacement Solution) on third-party websites not proxied by Cloudflare, the rest that are proxied throwing 500 Internal server error saying it's Cloudflare's fault…
Investigating - Cloudflare is aware of, and investigating an issue which potentially impacts multiple customers. Further detail will be provided as more information becomes available.
Nov 18, 2025 - 11:48 UTC
Yeah, those multiple customers is like 70% of the internet.
I'm thinking about all those quips from a few decades back, along the lines of: "The Internet is resilient, it's distributed and it routes around damage" etc.
In many ways it's still true, but it doesn't feel like a given anymore.
We finally switched to CF a few weeks ago (for bot protection, abusive traffic started getting insane this year), finally we can join in on one of the global outage parties (no cloud usage otherwise, so still more uptime than most).
Hey, this is fun, all my websites are still up! I wonder how that happened? I don't even have to worry about my docker registry being down because I set up my own after the last global outage.
Because javascript programmers are cheaper/easier/whatever to hire? So everything becomes web-centric. (I'm hoping for this comment to be sarcastic but I wouldn't be surprised if it turns out not to be)
This Internet thing is steadily becoming the most fragile surface attack out there. No need for nuclear weapons anymore, just hit Cloudflare and AWS and we are back to the stone age.
Why are we seeing AWS, then Azure, then Cloudflare all going down just out of the blue? I know they go down occasionally, but it's typically not major outages like this...
We're on the enterprise plan, so far we're seeing Dashboard degradation and Turnstile (their captcha service) down. But all proxying/CDN and other services seem to work well.
If someone wanted to learn about how the modern infrastructure stack works, and why things like this occur, where would be some good resources to start?
I sometimes question my business decision to have a multi-cloud, multi-region web presence where it is totally acceptable to be down with the big boys.
Prior hosting provider was a little-known company with decent enough track record, but because they employed humans, stuff would break. When it did break, C-suite would panic about how much revenue is lost, etc.
The number of outages was "reasonable" to anyone who understood the technical side, but non-technical would complain for weeks after an outage about how we're always down, "well BigServiceX doesn't break ever, why do we?", and again lost revenue.
Now on Azure/Cloudflare, we go down when everyone else does, but C-Suite goes "oh it's not just us, and it's out of our control? Okay let us know when it fixes itself."
A great lesson in optics and perception, for our junior team members.
Ah! Well, all of my websites are down! I’m going to take screenshots and have it as part of my Time Capsule Album, “Once upon a Time, my websites used to go down.”
Strange thing is this is in multiple CD regions all using bot & WAF are down, just got a colueuge to check our site and both London & Singapour cloudflare servers are out... And I cant even login to the cloudflare dash to re-route critical traffic
. Likely this is accidental, but one day there will be something malicous that will have big impacts with how centralised the internet now is.
I assume the locations are operating fine, since you can see the error pages. The culprit here is probably the Network, which at the time of writing, shows up as offline
makes you realise, if cloudflare or one of these large organisations decides to (/ gets ordered by a deranged US president to) block your internet access, that's a whole lot of internet you're suddenly cut off from. Yes, i know there are circumventions, but its still a owrrying thought.
In theory even a single company service could be distributed, so only a fraction of websites would be affected, thus it's not a necessity to be a single point of failure. So I still don't like this argument "you see what happens when over half of the internet relies on Cloudflare". And yes, I'm writing this as a Cloudflare user whose blog is now down because of this. Cloudflare is still convenient and accessible for many people, no wonder why it's so popular.
But, yeah, it's still a horrible outage, much worse than the Amazon one.
The "omg centralized infra" cries after every such event kind of misses the point. Hosting with smaller companies (shared, vps, dedi, colo whatever) will likely result in far worse downtimes, individually.
Ofc the bigger perception issue here is many services going out at the same time, but why would (most) providers care if their annual downtime does or doesn't coincide with others? Their overall reliability is no better or worse had only their service gone down.
All of this can change ofc if this becomes a regular thing, the absolute hours of downtime does matter.
It would appear if you use a VPN in Europe you can still access Cloudflare sites, I have just tried, for me the Netherlands, Germany, and France work, but the UK and USA don't.
EDIT: It would appear it is still unreliable in these countries, it just stopped working in France for me.
If a cloud vendor with 1 million users experiences a long term outage: the vendor has a serious problem. If a cloud vendor with 1 billion users experiences a long term outage: the internet has a serious problem. Yada-yada-yada xkcd/2347 but it's the big block in the middle which crumbled
For fun, I asked google what's an alternative to Cloudflare. It says, "A complete list of Cloudflare alternatives depends on which specific service (CDN, security, Zero Trust, edge computing, etc.) you are replacing, as no single competitor offers the exact same all-in-one suite"
Didn't have my site on cloudflare bc it would be faster for chinese users (its main demographic) so i THOUGHT i was fine for a second until i remembered the data storage api is behind cloudflare
used a down-detector site to check if cloudflare is down, but the site is running on cloudflare, so i couldnt check if cloudflare was down for anyone else, because cloudflare was down
Cloud in general was a mistake. We took a system explicitly designed for decentralization and resilience and centralized it and created a few neat points of failure to take the whole damn thing down.
Cloudflare provides some nice services that have nothing to do with cloud or not. You can self-host private tunnels, application firewalls, traffic filtering, etc, or you can focus on building your application and managing your servers.
I am a self-host enthousiast. So I use Hetzner, Kamal and other tools for self-managing our servers, but we still have Cloudflare in front of them because we didn't want to handle the parts I mentioned (yet, we might sometime).
Calling it a mistake is a very narrow look at it. Just because it goes down every now and then, it isn't a mistake. Going for cloud or not has its trade-offs and I agree that paying 200 dollars a month for a 1GB Heroku Redis instance is complete madness when you can get a 4GB VPS on Hetzner for 3,8 a month. Then again, some people are willing to make that trade-off for not having to manage the servers.
Cloud servers have taught me so much about working with servers because they are so easy and cheap to spin up, experiment with and then get rid of again. If I had had to buy racks and host them each time I wanted to try something, I would've never done it.
Sure, it's a great fair-weather technology, makes some things cheap and easy.
But in the face of adversity, it's a huge liability. Imagine Chinese Hackers taking down AWS, Cloudflare, Azure and GCP simultaneously in some future conflict. Imagine what that would do to the West.
I don't believe in Fukuyamas End of History. History is still happening, and the choices we make will determine how it plays out.
Thanks, I was too lazy to write this, and noticed this comment multiple times now. It's good to be sceptical at times, but in this case it simply misses the mark.
Threat actors (DDoS) and AI scraping already threw a wrench in decentralization. It's become quite difficult to host anything even marginally popular without robust infrastructure that can eat a lot of traffic
Down... "Please unblock challenges.cloudflare.com to proceed." On every Cloudflare hosted website that I try. This timing SUCKS.......... please resolve fast! <3
Oh no, we can’t take a (former) executive to task about what they’ve wrought with their influence!!! That would be wrong.
If anything, he should be the first to be blamed for the greater and greater effect this tech monster has on internet stability, since, you know, his people built it.
When will Cloudflare actually split into several totally independent companies to remedy that they bring down the Internet every time they have a major issue?
just yesterday cloudflare announced it was acquiring replicate (ai platform) "the Workers Platform mission: Our goal all along has been to enable developers to build full-stack applications without having to burden themselves with infrastructure" according to cloudflare's blog, are we cooked?
I am using cloudflare as back-end for my site (workers) but have disabled all their other offerings. I was affected for a short while but seems to be less affected than other people.
The biggest learning for me from this incident - NEVER make your DNS provider and CDN provider the same vendor. Now, I can't login into the dashboard, even to switch the DNS. Sigh.
Linode has been rock solid for me. I wanted to back this comment with uptime numbers, unfortunately the service I use for that, Uptime Robot, is down because of Cloudflare...
Update
We've deployed a change which has restored dashboard services. We are still working to remediate broad application services impact
Posted 2 minutes ago. Nov 18, 2025 - 14:34 UTC
but,..
I'm stuck at the captcha that does not work:
dash.cloudflare.com
Verifying you are human. This may take a few seconds.
dash.cloudflare.com needs to review the security of your connection before proceeding.
Haha they updated their status page: "Identified - A global upstream provider is currently experiencing an outage which is impacting platform-level and project-level services"
while my colleagues are wondering why cloudlfare isn't working and are afraid it might be something from us locally, I'll first check here to make sure it's not a Cloudflare / AWS problem in the first place.
It's the old IBM thing. If your website goes down along with everyone else's because of Cloudflare, you shrug and say "nothing we could do, we were following the industry standard". If your website goes down because of on-prem then it's very much your problem and maybe you get to look forward to an exciting debrief with your manager's manager.
That's lazy engineering and I don't think we as technical, rational people should make that our way of working. I know the saying, but I disagree with it. My fuckups, my problem, but at least I can avoid fuckups actively if I am in charge.
I don't, since my stuff is reachable only within the company network/VPN. If I needed to though, I would consult the BSI list of official DDOS mitigation services [0] and evaluate each one before deciding. I would not auto-pick Cloudflare.
Yeah, but people aren't using Cloudflare just for DDOS Mitigation. Some are running pretty much everything over it, from DNS to edge caching to load balancing and even hosting. That's what I oppose mainly.
Unless you are really big, onprem stuff would be 90% internal anyway. For everything public you'd host your hardware in a datacenter with better high speed connectivity. And pretty much every single datacenter I interacted with in the last 5 years does have a DDOS protection solution that you can order for your network.
That's fair, yeah, and I agree it's not always feasible - but if you have any influence over technical direction at your org, I encourage what I wrote above. Otherwise yeah, let the pea counters in the C-Levels dig their own grave.
Funnily and ironically enough, I was trying to check out a few things on Ansible Galaxy and... I ended up here trying to submit the link for the CF ongoing incident
I would only consider doing stuff on-prem because of services like Cloudflare. You can have some of the global features like edge-caching while also getting the (cost) benefits of on-prem.
Well, between AWS US EAST 1 killing half the internet, and this incident, not even a month passed. Meanwhile, my physical servers don't care and happily serve many people at a cheaper cost than any cloud offer.
You realize these are two different companies right? If you’re saying “I’m an AWS customer with cloudflare in front” I think you’ve failed to realize that two 99.9% available services in series have a combined availability of ~99.8% - that’s just math.
Your physical servers should have similar issues if you put a CDN in front unless the physical server is able to achieve a 100% uptime (100% * 3 9s = 3 9s). Or you don’t have a CDN but can be trivially knocked offline by the tiniest botnet (or even hitting hacker news front page)
I do. But I put both into the "cloud offering off-prem for very much money" shoebox. I setup a CDN once using VPS from different hosting providers for under 100 USD a month, which I would vastly prefer over trusting anything cloud.
And yes, I know that there's sites that need the scale of an operation like Cloudflare or AWS. But 99.9(...)% of pages don't, and people should start realizing that.
People who don't need that, also don't care much for an hour or two of service disruption. Most users will have far worse disruptions with the alternatives.
We have a few colocated servers offsite, each in a different region, each with a zpool of mirrored spinning rust. We use rsync across those at different times.
This seems to corroborate the recent controversial claims that American workers do not possess the aptitudes needed to succeed in the 21st century. If only we could have gotten more children to learn to code. Sigh.
I happened to be working with Claude when this occurred. Having no idea what exactly what the cause was, I jumped over to GPT and observed the same. I did a dig challenges.cloudflare.com and by the time I'd figured out kind of what was happening, it seemed to have... resolved itself
I must say I'm astonished, as naive as it may be, to see the number of separate platforms affected by this. And it has been a bit of a learning experience too.
Is anybody keeping statistics on the frequency of these big global internet outages? It seems to be happening extremely frequently as of late, but it would be nice to have some data on that.
My theory is that people's skills are getting worse. Attention spans are diminishing, memory is shrinking. People age and retire, new less skilled generations are replacing them. There are studies about declining IQ in the last decades. Probably mobile phones and social media are to blame.
We see the signs with Amazon and Cloudflare going down, Windows Update breaking stuff. But the worse is yet to come, and I am thinking about airport traffic control, nuclear power plants, surgeons...
> There are studies about declining IQ in the last decades. Probably mobile phones and social media are to blame.
It is much more nuanced than that.
The long-term rise (Flynn Effect) of IQs in the 20th century is widely believed to be driven by environmental factors more than genetics.
Plateau / decline is context-dependent: The reversal or slowdown isn’t universal, like you suggest. It seems more pronounced in certain countries or cohorts.
Cognitive abilities are diversifying: As people specialize more (education, careers, lifestyles), the structure of intelligence (how different cognitive skills relate) might be changing.
I am paying for this shit service and this is my longest downtime I had in years. Can anyone recommend any other bottleneck to be annoyed with in future?
They are decentralized with servers all on the East coast that they self host. They do have points of failure that can take down the whole network, however.
Half of the internet is down. That's what you get for giving up the control of the service that suppose to be decentralized to one company. Good, maybe if it costs companies few billions they will not put all eggs in one basket.
I'm weary of the broader internet having spofs like AWS and Cloudflare. Can't change routing or DNS horizons to get around it. Things are just broken in ways that are not only opaque, but destructive due to so much relying on fragile sync state.
Will my Spelling Bee QBABM count today, or will it fail and tomorrow I find out that last MA(4) didn't register, ruining my streak? Society cannot function like this! /s
AWS, Azure, now Cloudflare, all within a month, are hit with configuration errors that are definitely neither signs of more surveillance gear being added by government agencies nor attacks by hostile powers. It's a shame that these fine services that everyone apparently needs and that worked so well for so long without a problem suddenly all have problems at the same time.
AWS was not a configuration error, it was a race condition on their load balancer's automated DNS record attribution that caused empty DNS records. As that issue was being fixed, it cascaded into further, more complex issues overloading EC2 instance provisioning.
Gemini is up, I asked it to explain what's going on in cave man speak:
YOU: Ask cave-chief for fire.
CAVE-CHIEF (Cloudflare): Big strong rock wall around many other cave fires (other websites). Good, fast wall!
MANY CAVE-PEOPLE: Shout at rock wall to get fire.
ROCK WALL: Suddenly… CRACK! Wall forgets which cave has which fire! Too many shouts!
RESULT:
Your Shout: Rock wall does not hear you, or sends you to wrong cave.
Other Caves (like X, big games): Fire is there, but wall is broken. Cannot get to fire.
ME (Gemini): My cave has my own wall! Not rock wall chief! So my fire is still burning! Good!
BIG PROBLEM: Big strong wall broke. Nobody gets fire fast. Wall chief must fix strong rock fast!