Inside Cloudflare’s Power Outage Ordeal: My Take on the Unprecedented Two-Day Service Challenge at the Main Data Center

Content Error or Suggest an Edit

Notice a grammatical error or technical inaccuracy? Let us know; we will give you credit!

Introduction

On Thursday, November 2nd, 2023, at 11:43 UTC, Cloudflare had an outage that affected several of their services, either completely offline or partially operational. Overall, you couldn’t change the configuration of Cloudflare services, access the Cloudflare Dashboard or API, or view analytics or logs. For example, there were no issues with DNS resolution, but updating DNS records led to errors.

The post-mortem from CEO and Co-Founder Matthew Prince

Post Mortem on Cloudflare Control Plane and Analytics Outage
Beginning on Thursday, November 2, 2023 at 11:43 UTC Cloudflare’s control plane and analytics services experienced an outage. Here are the details
blog.cloudflare.com

So Cloudflare’s CEO and Co-Founder Matthew Prince posted on their blog a post-mortem of the Nov 2nd-4th incident that saw a major Cloudflare outage.

I’ve had involvement in multiple small and large outages in my career, spanning 500 users to over 1 million. Outages happen; you can’t cheat death or an outage. But you can plan and control what happens during an outage.

Cloudflare did more than most companies here; they planned and tested for such an outage and built-in necessary redundancies.
Once you’re humming along successfully for quite some time, it’s easy to forget that an outage is a keystroke away. It’s easy to forget that you need to be diligent in your decision-making and how an outage might affect what you’re building.

As a company that is super agile and constantly pushing out new products and improving their existing products. How can you diligently ensure redundancy is the top priority without stifling innovation? Processes, audits and testing at a level that is ingrained into everything.

The post-mortem was excellent; there was transparency and a commitment to improve. There was also finger-pointing, followed by accountability. As a customer, partner and stockholder, I have some questions and feelings about what happened. But I also realize it could have been worse, much worse.
I’m glad this happened; failures are necessary and beneficial, but only if you learn from them. The message from Matthew Prince is that Cloudflare has learned from this failure; let’s see if they act. I’m sure there’ll be many blog posts about upcoming changes due to this outage.

I’m still a customer and will still recommend Cloudflare. Some folks will do the opposite, and competitors will take advantage of this. End of the day, you pick what works for you. There will always be bias; I tried not to be biased and be empathetic to the situation.

Other Media Coverage

0 Shares:

You May Also Like