by tetha 7 hours ago

This is kinda what I'm thinking. We're absolutely not at the scale Cloudflare is at.

But we run software and configuration changes through three tiers - first stage for the dev-team only, second stage with internal customers and other teams depending on it for integration and internal usage -- and finally production. Some teams have also split production into different rings depending on the criticality of the customers and the number of customers.

This has lead to a bunch of discussions early on, because teams with simpler software and very good testing usually push through dev and testing with no or little problem. And that's fine. If you have a track record of good changes, there is little reason to artificially prolong deployment in dev and test just because. If you want to, just go through it in minutes.

But after a few spicy production incidents, even the better and faster teams understood and accepted that once technical velocity exists, actual velocity is a choice, or a throttle if you want an analogy.

If you do good, by all means, promote from test to prod within minutes. If you fuck up production several times in a row and start threatening SLAs, slow down, spend more resources on manual testing and improving automated testing, give changes time to simmer in the internally productive environment, spend more time between promotions from production ring to production ring.

And this is on top of considerations of e.g. change risk. Some frontend-only application can move much faster than the PostgreSQL team, because one rollback is a container restart, and the other could be a multi-hour recovery from backups.