This is architectural problem, the LUA bug, the longer global outage last week, a long list of earlier such outages only uncover the problem with architecture underneath. The original, distributed, decentralized web architecture with heterogeneous endpoints managed by myriad of organisations is much more resistant to this kind of global outages. Homogeneous systems like Cloudflare will continue to cause global outages. Rust won't help, people will always make mistakes, also in Rust. Robust architecture addresses this by not allowing a single mistake to bring down myriad of unrelated services at once.
I’m not sure I share this sentiment.
First, let’s set aside the separate question of whether monopolies are bad. They are not good but that’s not the issue here.
As to architecture:
Cloudflare has had some outages recently. However, what’s their uptime over the longer term? If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.
But there’s a more interesting argument in favour of the status quo.
Assuming cloudflare’s uptime is above average, outages affecting everything at once is actually better for the average internet user.
It might not be intuitive but think about it.
How many Internet services does someone depend on to accomplish something such as their work over a given hour? Maybe 10 directly, and another 100 indirectly? (Make up your own answer, but it’s probably quite a few).
If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year.
On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.
It’s not really bad end user experience that every service uses cloudflare. It’s more-so a question of why is cloudflare’s stability seeming to go downhill?
And that’s a fair question. Because if their reliability is below average, then the value prop evaporates.
> If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.
The point is that it doesn’t matter. A single site going down has a very small chance of impacting a large number of users. Cloudflare going down breaks an appreciable portion of the internet.
If Jim’s Big Blog only maintains 95% uptime, most people won’t care. If BofA were at 95%.. actually same. Most of the world aren’t BofA customers.
If Cloudflare is at 99.95% then the world suffers
Maybe worlds can just live without the internet for a few hours.
There are likely emergency services dependent on Cloudflare at this point, so I’m only semi serious.
That's an interesting point, but in many (most?) cases productivity doesn't depend on all services being available at the same time. If one service goes down, you can usually be productive by using an alternative (e.g. if HN is down you go to Reddit, if email isn't working you catch up with Slack).
If HN, Reddit, email, Slack and everything else is down for a day, I think my productivity would actually go up, not down.
"My architecture depends upon a single point of failure" is a great way to get laughed out of a design meeting. Outsourcing that single point of failure doesn't cure my design of that flaw, especially when that architecture's intended use-case is to provide redundancy and fault-tolerance.
The problem with pursuing efficiency as the primary value prop is that you will necessarily end up with a brittle result.
> If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.
I’m tired of this sentiment. Imagine if people said, why develop your own cloud offering? Can you really do better than VMWare..?
Innovation in technology has only happened because people dared to do better, rather than giving up before they started…
> If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.
I disagree; most people need only a subset of Cloudflare's features. Operating just that subset avoids the risk of the other moving parts (that you don't need anyway) ruining your day.
Cloudflare is also a business and has its own priorities like releasing new features; this is detrimental to you because you won't benefit from said feature if you don't need it, yet still incur the risk of the deployment going wrong like we saw today. Operating your own stack would minimize such changes and allow you to schedule them to a maintenance window to limit the impact should it go wrong.
The only feature Cloudflare (or its competitors) offers that can't be done cost-effectively yourself is volumetric DDoS protection where an attacker just fills your pipe with junk traffic - there's no way out of this beyond just having a bigger pipe, which isn't reasonable for any business short of an ISP or infrastructure provider.
>The only feature Cloudflare (or its competitors) offers that can't be done cost-effectively yourself is volumetric DDoS protection
.... And thanks to AI everyone needs that all the time now since putting a site on the Internet means an eternal DDoS attack.
> Cloudflare has had some outages recently. However, what’s their uptime over the longer term? If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.
Why is that the only option? Cloudflare could offer solutions that let people run their software themselves, after paying some license fee. Or there could be many companies people use instead, instead of everyone flocking to one because of cargoculting "You need a CDN like Cloudflare before you launch your startup bro".
What you’re suggesting is not trivial. Otherwise we wouldn’t use various CDNs. To do what Cloudflare does your starting point is “be multiple region/multiple cloud from launch” which is non-trivial especially when you’re finding product-market fit. A better poor man’s CDN is object storage through your cloud of choice serving HTTP traffic. Cloudflare also offers layers of security and other creature comforts. Ignoring the extras they offer, if you build what they offer you have effectively made a startup within a startup.
Cloudflare isn’t the only game in town either. Akamai, Google, AWS, etc all have good solutions. I’ve used all of these at jobs I’ve worked at and the only poor choice has been to not use one at all.
What do you think Cloudflare’s core business is? Because I think it’s two things:
1. DDoS protection
2. Plug n’ Play DNS and TLS (termination)
Neither of those make sense for self-hosted.
Edit: If it’s unclear, #2 doesn’t make sense because if you self-host, it’s no longer plug n’ play. The existing alternatives already serve that case equally well (even better!).
Cloudflare Zero-Trust is also very core to their enterprise business.
All of my company's hosted web sites have way better uptimes and availability than CF but we are utterly tiny in comparison.
With only some mild blushing, you could describe us as "artisanal" compared to the industrial monstrosities, such as Cloudflare.
Time and time again we get these sorts of issues with the massive cloudy chonks and they are largely due to the sort of tribalism that used to be enshrined in the phrase: "no one ever got fired for buying IBM".
We see the dash to the cloud and the shoddy state of in house corporate IT as a result. "We don't need in-house knowledge, we have "MS copilot 365 office thing" that looks after itself and now its intelligent - yay \o/
Until I can't, I'm keeping it as artisanal as I can for me and my customers.
In other words, the consolidation on Cloudflare and AWS makes the web less stable. I agree.
Usually I am allergic to pithy, vaguely dogmatic summaries like this but you're right. We have traded "some sites are down some of the time" for "most sites are down some of the time". Sure the "some" is eliding an order of magnitude or two, but this framing remains directionally correct.
Does relying on larger players result in better overall uptime for smaller players? AWS is providing me better uptime than if I assembled something myself because I am less resourced and less talented than that massive team.
If so, is it a good or bad trade to have more overall uptime but when things go down it all goes down together?
From a societal view it is worse when everything is down at once. Leads to a less resilient society: It is not great if I can't buy essentials from one store because their payment system is down (this happened to one super market chain in Sweden due to a hacker attack some years ago, took weeks to fully fix everything, and then there was that whole Crowdstrike debacle globally more recently).
It is far worse if all of the competitors are down at once. To some extent you can and should have a little bit of stock at home (water, food, medicine, ways to stay warm, etc) but not everything is practical to do so with (gasoline for example, which could have knock on effects on delivery of other goods).
When only one thing goes down, it's easier to compensate with something else, even for people who are doing critical work but who can't fix IT problems themselves. It means there are ways the non-technical workforce can figure out to keep working, even if the organization doesn't have on-site IT.
Also, if you need to switchover to backup systems for everything at once, then either the backup has to be the same for everything and very easily implementable remotely - which to me seems unlikely for specialty systems, like hospital systems, or for the old tech that so many organizations still rely on (and remember the CrowdStrike BSODs that had to be fixed individually and in person and so took forever to fix?) - or you're gonna need a LOT of well-trained IT people, paid to be on standby constantly, if you want to fix the problems quickly, on account of they can't be everywhere at once.
If the problems are more spread out over time, then you don't need to have quite so many IT people constantly on standby. Saves a lot of $$$, I'd think.
And if problems are smaller and more spread out over time, then an organization can learn how to deal with them regularly, as opposed to potentially beginning to feel and behave as though the problem will never actually happen. And if they DO fuck up their preparedness/response, the consequences are likely less severe.
> AWS is providing me better uptime than if I assembled something myself because I am less resourced and less talented than that massive team.
Is it? I can’t say that my personal server has been (unplanned) down at any time in the past 10 years, and these global outages have just flown right past it.
Have your ISP never went down? Or did it went down in some night and you just never realized.
AWS and Cloudflare can recover from outages faster because they can bring dozens (hundreds?) of people to help, often the ones who wrote the software and designed the architecture. Outages at smaller companies I've worked for have often lasted multiple days, up to an exchange server outage that lasted 2 weeks.
Would you rather be attacked by 1,000 wasps or 1 dog? A thousand paper cuts or one light stabbing? Global outages are bad but the choice isn’t global pain vs local pleasure. Local and global both bring pain, with different, complicated tradeoffs.
Cloudflare is down and hundreds of well paid engineers spring into action to resolve the issue. Your server goes down and you can’t get ahold of your Server Person because they’re at a cabin deep in the woods.
If you've allowed your Server Person to be a single point of failure out innawoods, that's an organizational problem, not a technological one.
Two is one and one is none.
It's not "1,000 wasps or 1 dog", it's "1,000 dogs at once, or "1 dog at once, 1,000 different times". Rare but huge and coordinated siege, or a steady and predictable background radiation of small issues.
The latter is easier to handle, easier to fix, and much more suvivable if you do fuck it up a bit. It gives you some leeway to learn from mistakes.
If you make a mistake during the 1000 dog siege, or if you don't have enough guards on standby and ready to go just in case of this rare event, you're just cooked.
Why would there be a centralized outage of decentralized services? The proper comparison seems to be attacked by a dog or a single wasp.
In most cases we actually get both local and global pain, since most people are running servers behind Cloudflare.
> Rust won't help, people will always make mistakes, also in Rust.
They don't just use Rust for "protection", they use it first and foremost for performance. They have ballpark-to-matching C++ performance with a realistic ability to avoid a myriad of default bugs. This isn't new.
You're playing armchair quarterback with nothing to really offer.
> Homogeneous systems like Cloudflare will continue to cause global outages
But the distributed system is vulnerable to DDOS.
Is there an architecture that maintains the advantages of both systems? (Distributed resilience with a high-volume failsafe.)
It's not as simple as that. What will result in more downtime, dependency on a single centralized service or not being behind Cloudflare? Clearly it's the latter or companies wouldn't be behind Cloudflare. Sure, the outages are more widespread now than they used to be, but for any given service the total downtime is typically much lower than before centralization towards major cloud providers and CDNs.
Obviously Rust is the answer to these kind of problems. But if you are cloudflare and have an important company at a global scale, you need to set high standarts for your rust code. Developers should dance and celebrate end of the day if their code compiles in rust.
Robust architecture that is serving 80M requests/second worldwide?
My answer would be that no one product should get this big.
On the other hand, as long as the entire internet goes down when Cloudflare goes down, I'll be able to host everything there without ever getting flack from anyone.
You're not wrong, but where's the robust architecture you're referring to? The reality of providing reliable services on the internet is far beyond the capabilities of most organizations.
You have a heterogeneous, fault-free architecture for the Cloudflare problem set? Interesting! Tell us more.
They badly need smaller blast radius and to use more chaos engineering tools.
You should really check Cloudflare.
There is not a single company that makes their infrastructure as globally available like Cloudflare.
Additionally, the downtime of Cloudflare seems to be objectively less than the others.
Now, it took 25 minutes for 28% of the network.
While being the only ones to fix a global vulnerability.
There is a reason other clouds wouldn't touch the responsiveness and innovation that Cloudflare brings.
Bro, but how do we make shareholder value if we don't monopolize and enshittify everything