Having their changes fully propagate within 1 minute is pretty fantastic.
This is most likely a strong requisite for such a big scale deployment if DDOS protection and detection - which explains their architectural choices (ClickHouse & co) and the need of a super low latency config changes.
Since attackers might rotate IPs more frequently than once per minute, this effectively means that the whole fleet of servers should be able to quickly react depending on the decisions done centrally.
Why wasn’t the rollback fixed within the second minute after they saw the 500s?
The coolest part of Cloudflare’s architecture is that every server is the same… which presumably makes deployment a straightforward task.
The bad change wasn't even a deployment as such, just an entry in the global KV store https://blog.cloudflare.com/introducing-quicksilver-configur...
Actual deployments take hours to propagate worldwide.
(Disclosure: former Cloudflare SRE)