by parchley 7 hours ago

> At Cloudflare's scale, I would not be surprised if they require three consecutive events to trigger an alert.

Sorry but that’s a method you use if you serve 100 requests per second, not when you are at Cloudflare scale. Cloudflare easily have big enough volume that this problem would trigger an instant change in a monitorable failure rate.

rossjudson 4 hours ago | [-0 more]

Let's say you have 10 million servers. No matter what you are deploying, some of those servers are going to be in a bad state. Some are going to be in hw ops/repair. Some are going to be mid-deployment for something else. A regional issue may be happening. You may have inconsistencies in network performance. Bad workloads somewhere/anywhere can be causing a constant level of error traffic.

At scale there's no such thing as "instant". There is distribution of progress over time.

The failure is an event. Collection of events takes times (at scale, going through store and forward layers). Your "monitorable failure rate" is over an interval. You must measure for that interval. And then you are going to emit another event.

Global config systems are a tradeoff. They're not inherently bad; they have both strengths and weaknesses. Really bad: non-zero possibility for system collapse. Bad: Can progress very quickly to towards global outages. Good: Faults are detected quickly, response decision making is easy, and mitigation is fast.

Hyperscale is not just "a very large number of small simple systems".

Denoising alerts is a fact of life for SRE...and is a survival skill.