> 2 minutes for their automated alerts to fire is terrible
I take exception to that, to be honest. It's not desirable or ideal, but calling it "terrible" is a bit ... well, sorry to use the word ... entitled. For context, I have experience running a betting exchange. A system where it's common for a notable fraction of transactions in a medium-volume event to take place within a window of less than 30 seconds.
Vast majority of current monitoring systems are built on Prometheus. (Well okay, these days it's more likely something Prom-compatible but more reliable.) That implies collection via recurring scrapes. A supposedly "high" frequency online service monitoring system does a scrape every 30 seconds. Well known reliability engineering practices state that you need a minimum of two consecutive telemetry points to detect any given event - because we're talking about a distributed system and network is not a reliable transport. That in turn means that with near-perfect reliability the maximum time window before you can detect something failing is the time it takes to perform three scrapes: thing A might have failed a second after the last scrape, so two consecutive failures will show up only after a delay of just-a-hair-shy-of-three scraping cycle windows.
At Cloudflare's scale, I would not be surprised if they require three consecutive events to trigger an alert.
As for my history? The betting exchange monitoring was tuned to run scrapes at 10-second intervals. That still meant that the first an alert fired for something failing could have been effectively 30 seconds after the failures manifested.
Two minutes for something that does not run primarily financial transactions is a pretty decent alerting window.
Prometheus compatible but more reliable? Sell it to me!
> At Cloudflare's scale, I would not be surprised if they require three consecutive events to trigger an alert.
Sorry but that’s a method you use if you serve 100 requests per second, not when you are at Cloudflare scale. Cloudflare easily have big enough volume that this problem would trigger an instant change in a monitorable failure rate.
Let's say you have 10 million servers. No matter what you are deploying, some of those servers are going to be in a bad state. Some are going to be in hw ops/repair. Some are going to be mid-deployment for something else. A regional issue may be happening. You may have inconsistencies in network performance. Bad workloads somewhere/anywhere can be causing a constant level of error traffic.
At scale there's no such thing as "instant". There is distribution of progress over time.
The failure is an event. Collection of events takes times (at scale, going through store and forward layers). Your "monitorable failure rate" is over an interval. You must measure for that interval. And then you are going to emit another event.
Global config systems are a tradeoff. They're not inherently bad; they have both strengths and weaknesses. Really bad: non-zero possibility for system collapse. Bad: Can progress very quickly to towards global outages. Good: Faults are detected quickly, response decision making is easy, and mitigation is fast.
Hyperscale is not just "a very large number of small simple systems".
Denoising alerts is a fact of life for SRE...and is a survival skill.
Critical high-level stats such as errors should be scraped more frequently than 30 seconds. It’s important to have multiple time granularity scraping intervals, a small set of most critical stats should be scraped closer to 10s or 15s.
Prometheus has as an unaddressed flaw [0], where rate functions must be at least 2x the scrape interval. This means that if you scrape at 30s intervals, your rate charts won’t reflect the change until a minute after.
"Scrape" intervals (and the plumbing through to analysis intervals) are chosen precisely because of the denoising function aggregation provides.
Most scaled analysis systems provide precise control over the type of aggregation used within the analyzed time slices. There are many possibilities, and different purposes for each.
High frequency events are often collected into distributions and the individual timestamps are thrown away.