by uyzstvqs 12 hours ago

What I'm missing here is a test environment. Gradual or not; why are they deploying straight to prod? At Cloudflare's scale, there should be a dedicated room in Cloudflare HQ with a full isolated model-scale deployment of their entire system. All changes should go there first, with tests run for every possible scenario.

Only after that do you use gradual deployment, with a big red oopsie button which immediately rolls the changes back. Languages with strong type systems won't save you, good procedure will.

tetha 7 hours ago | [-0 more]

This is kinda what I'm thinking. We're absolutely not at the scale Cloudflare is at.

But we run software and configuration changes through three tiers - first stage for the dev-team only, second stage with internal customers and other teams depending on it for integration and internal usage -- and finally production. Some teams have also split production into different rings depending on the criticality of the customers and the number of customers.

This has lead to a bunch of discussions early on, because teams with simpler software and very good testing usually push through dev and testing with no or little problem. And that's fine. If you have a track record of good changes, there is little reason to artificially prolong deployment in dev and test just because. If you want to, just go through it in minutes.

But after a few spicy production incidents, even the better and faster teams understood and accepted that once technical velocity exists, actual velocity is a choice, or a throttle if you want an analogy.

If you do good, by all means, promote from test to prod within minutes. If you fuck up production several times in a row and start threatening SLAs, slow down, spend more resources on manual testing and improving automated testing, give changes time to simmer in the internally productive environment, spend more time between promotions from production ring to production ring.

And this is on top of considerations of e.g. change risk. Some frontend-only application can move much faster than the PostgreSQL team, because one rollback is a container restart, and the other could be a multi-hour recovery from backups.

bombcar 8 hours ago | [-5 more]

They have millions of “free” subscribers; said subscribers should be the test pigs for rollouts; paying (read: big) subscribers can get the breaking changes later.

beardedetim 8 hours ago | [-2 more]

This feels like such a valid solution and is how past $dayjobs released things: send to the free users, rollout to Paying Users once that's proven to not blow up.

sznio 6 hours ago | [-1 more]

If your target is availability, that's correct.

If your target is security, then _assuming your patch is actually valid_ you're giving better security coverage for free customers than to your paying ones.

Cloudflare is both, and their tradeoffs seem to be set on maximizing security at cost of availability. And it makes sense. A fully unavailable system is perfectly secure.

5 hours ago | [-0 more]
[deleted]
ectospheno 7 hours ago | [-1 more]

Free tier doesn’t get WAF. We kept working.

bsdpqwz 7 hours ago | [-0 more]

Their December 3rd blog about React states:

"These new protections are included in both the Cloudflare Free Managed Ruleset (available to all Free customers) ..... "

having some burn in time in free tier before it hits the whole network would have been good?!

11 hours ago | [-0 more]
[deleted]
vouwfietsman 9 hours ago | [-1 more]

> Languages with strong type systems won't save you

Neither will seatbelts if you drive into the ocean, or helmets if you drink poison. I'm not sure what your point is.

djmips 7 hours ago | [-0 more]

I think you strengthened their point.