What We Learned During the Recent Cloudflare Outage: Improving Monitoring Resilience

The Cloudflare outage last month exposed a critical flaw in our monitoring system. Our alerts started going haywire - constantly switching between reporting sites as up then down, then up again. This created a flood of contradictory notifications that rendered our entire alerting system practically useless during the very time our customers needed reliable information most.

The Anatomy of a CDN Outage

To understand why this happened, it's important to recognize how modern websites rely on Content Delivery Networks (CDNs) like Cloudflare. These services distribute website content across global server networks to improve loading speeds and reliability. When a CDN experiences issues, the effects aren't binary—they're inconsistent and unpredictable.

During the outage, some requests would succeed while others failed, depending on routing paths, retry mechanisms, and which specific Cloudflare nodes were handling the traffic. This created a situation where consecutive checks to the same endpoint might receive completely different responses, even seconds apart.

The Monitoring Challenge

Traditional uptime monitoring follows a relatively simple logic: if a check fails, the service is considered down; if it succeeds, the service is up. This approach works well in straightforward scenarios where issues are clear-cut, but it falls apart during complex, partial outages affecting internet infrastructure.

During the Cloudflare incident, we discovered that our existing verification system wasn't robust enough. We already had different-region verification checks in place, but they weren't designed to handle the unique challenges of a global CDN outage. Our monitors were technically doing their job - accurately reporting what they saw at each moment, but the rapidly fluctuating conditions meant users received a large amount of contradictory alerts as services appeared to go up and down repeatedly. The system was working as designed, but the design itself needed improvement for these edge cases.

The Real Cost: Alert Flooding

For our users, this led to a frustrating sequence of incident notifications:

1. "Service X is down"

2. "Service X is back up" (15 minutes later)

3. "Service X is down again" (10 minutes later)

4. "Service X is back up" (20 minutes later)

This pattern continued for hours during the outage. While each alert was triggered by real data, the overall effect was that our notifications created noise that obscured rather than clarified the situation. Once we discovered the issue, we temporally disabled notifications for affected monitors and got to work fixing the issue.

Our Solution: Intelligent Incident Management

After analyzing the data from this event, we've implemented several significant improvements to our incident management system:

1. Enhanced Verification

Rather than relying on a single backup check, we now employ a more sophisticated verification sequence when potential downtime is detected:

• Initial check fails from primary location

• First verification check from a different region

• If still failing, additional verification checks from multiple regions

• Temporal spacing between checks to account for transient issues

This approach dramatically reduces false positives while adding only seconds to incident detection time.

2. Incident Resolution

Similarly, we've improved how we determine when an incident should be resolved:

• Multiple consecutive successful checks required before closing an incident

• Verification from different geographic regions

• A new confidence scoring system based on stability patterns

These changes prevent the "flapping" behavior where incidents are repeatedly opened and closed during unstable conditions.

3. Intelligent Incident Statuses (Coming Soon)

Soon we will go beyond having simple up or down statuses. We are working on a solution to show our users more context around what is happening with their monitors during periods of incidents. This could look like:

• Alongside showing a service as down, we can show if some regions have started to recover but we are not confident enough to say the incident is entirely resolved.

• Partial incidents, where we can show that a monitor is experiencing some issues but not total downtime, this could mean the monitor is only down in certain regions but not globally or facing intermittent downtime but not enough to fully consider it down.

The Results: Clarity During Chaos

These improvements have fundamentally changed how our monitoring behaves during complex outages. Instead of contributing to the chaos with a flood of conflicting notifications, our system now provides clear, actionable intelligence about service health.

Our testing has so far shown that Monitodo should likely not face this issue again, however we would be lying by saying thats a guarantee. Despite the challenges downtime of major internet infrastructure causes, it is a huge learning opportunity for us here at Monitodo and we take these incidents as opportunities to improve our service.

Beyond Technology: The Human Factor

This experience reinforced something we've always believed: monitoring isn't just about detecting technical failures, it's about communicating meaningful information to humans who need to make decisions. The best monitoring solution isn't necessarily the one that detects every microsecond of instability, but rather the one that helps teams understand what's happening and what requires their attention.

Looking Forward

As the internet grows increasingly complex, with layers of interdependent services, the challenge of meaningful monitoring grows with it. We're committed to continuing this evolution, developing monitoring approaches that provide clarity rather than noise.

What We Learned During the Recent Cloudflare Outage: Improving Monitoring Resilience

The Anatomy of a CDN Outage

The Monitoring Challenge

The Real Cost: Alert Flooding

Our Solution: Intelligent Incident Management

The Results: Clarity During Chaos

Beyond Technology: The Human Factor

Looking Forward

Related articles

How Heartbeat Monitoring Works: Ensuring Your Processes Run Smoothly

Why a reliable uptime monitor matters

Monitodo

Product

Free Tools

Company

Support