Cloudflare outage on November 18, 2025 post mortem

1353 points777 comments17 hours ago
gucci-on-fleek

> This showed up to Internet users trying to access our customers' sites as an error page indicating a failure within Cloudflare's network.

As a visitor to random web pages, I definitely appreciated this—much better than their completely false “checking the security of your connection” message.

> The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions

Also appreciate the honesty here.

> On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures to deliver core network traffic. […]

> Core traffic was largely flowing as normal by 14:30. We worked over the next few hours to mitigate increased load on various parts of our network as traffic rushed back online. As of 17:06 all systems at Cloudflare were functioning as normal.

Why did this take so long to resolve? I read through the entire article, and I understand why the outage happened, but when most of the network goes down, why wasn't the first step to revert any recent configuration changes, even ones that seem unrelated to the outage? (Or did I just misread something and this was explained somewhere?)

Of course, the correct solution is always obvious in retrospect, and it's impressive that it only took 7 minutes between the start of the outage and the incident being investigated, but it taking a further 4 hours to resolve the problem and 8 hours total for everything to be back to normal isn't great.

show comments
SerCe

As always, kudos for releasing a post mortem in less than 24 hours after the outage, very few tech organisations are capable of doing this.

show comments
thatoneengineer

The unwrap: not great, but understandable. Better to silently run with a partial config while paging oncall on some other channel, but that's a lot of engineering for a case that apparently is supposed to be "can't happen".

The lack of canary: cause for concern, but I more or less believe Cloudflare when they say this is unavoidable given the use case. Good reason to be extra careful though, which in some ways they weren't.

The slowness to root cause: sheer bad luck, with the status page down and Azure's DDoS yesterday all over the news.

The broken SQL: this is the one that I'd be up in arms about if I worked for Cloudflare. For a system with the power to roll out config to ~all of prod at once while bypassing a lot of the usual change tracking, having this escape testing and review is a major miss.

show comments
ojosilva

This is the multi-million dollar .unwrap() story. In a critical path of infrastructure serving a significant chunk of the internet, calling .unwrap() on a Result means you're saying "this can never fail, and if it does, crash the thread immediately."The Rust compiler forced them to acknowledge this could fail (that's what Result is for), but they explicitly chose to panic instead of handle it gracefully. This is textbook "parse, don't validate" anti-pattern.

I know, this is "Monday morning quarterbacking", but that's what you get for an outage this big that had me tied up for half a day.

show comments
EvanAnderson

It reads a lot like the Crowdstrike SNAFU. Machine-generated configuration file b0rks-up the software that consumes it.

The "...was then propagated to all the machines that make up our network..." followed by "....caused the software to fail." screams for a phased rollout / rollback methodology. I get that "...it’s critical that it is rolled out frequently and rapidly as bad actors change their tactics quickly" but today's outage highlights that rapid deployment isn't all upside.

The remediation section doesn't give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy.

show comments
hnarn

> That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.

> The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

I'm no FAANG 10x engineer, and I appreciate things can be obvious in hindsight, but I'm somewhat surprised that engineering at the level of Cloudflare does not:

1. Push out files A/B to ensure the old file is not removed.

2. Handle the failure of loading the file (for whatever reason) by automatically reloading the old file instead and logging the error.

This seems like pretty basic SRE stuff.

show comments
jjice

There's (obviously) a lot of discussion around the use of `unwrap` in production code. I feel like I'm watching comments speak past each other right now.

I'd agree that the use of `unwrap` could possibly make sense in a place where you do want the system to fail hard. There's lot of good reasons to make the system fail hard. I'd lean towards an `expect` here, but whatever.

That said, the function already returns a `Result` and we don't know what the calling code looks like. Maybe it does do an `unwrap` there too, or maybe there is a save way for this to log and continue that we're not aware of because we don't have enough info.

Should a system as critical as the CF proxy fail hard? I don't know. I'd say yes if it was the kind of situation that could revert itself (like an incremental rollout), but this is such an interesting situation since it's a config being rolled out. Hindsight is 20:20 obviously, but it feels like there should've been better logging, deployment, rollback, and parsing/validation capabilities, no matter what the `unwrap`/`Result` option is.

Also, it seems like the initial Clickhouse changes could've been testing much better, but I'm sure the CF team realizes that.

On the bright side, this is a very solid write up so quickly after the outage. Much better than those times we get it two weeks later.

bri3d

Everyone is hating on unwrap, but to me the odd and more interesting part is that it took 3 hours to figure this out? Even with a DDoS red herring, shouldn’t there have been a crash log or telemetry anomaly correlated? Also, shouldn’t the next steps and resolution focus more on this aspect, since it’s a high leverage tool for identifying any outage caused by a panic rather than just preventing a recurrence of random weird edge case #9999999?

show comments
cowsandmilk

Blog post from less than a week ago on how Cloudflare avoids outages on configuration changes: https://blog.cloudflare.com/finding-the-grain-of-sand-in-a-h...

This has to sting a bit after that post.

otterley

> work has already begun on how we will harden them against failures like this in the future. In particular we are:

> Hardening ingestion of Cloudflare-generated configuration files in the same way we would for user-generated input

> Enabling more global kill switches for features

> Eliminating the ability for core dumps or other error reports to overwhelm system resources

> Reviewing failure modes for error conditions across all core proxy modules

Absent from this list are canary deployments and incremental or wave-based deployment of configuration files (which are often as dangerous as code changes) across fault isolation boundaries -- assuming CloudFlare has such boundaries at all. How are they going to contain the blast radius in the future?

This is something the industry was supposed to learn from the CrowdStrike incident last year, but it's clear that we still have a long way to go.

Also, enabling global anything (i.e., "enabling global kill switches for features") sounds like an incredibly risky idea. One can imagine a bug in a global switch that transforms disabling a feature into disabling an entire system.

show comments
lukan

"Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare. While it turned out to be a coincidence, it led some of the team diagnosing the issue to believe that an attacker may be targeting both our systems as well as our status page."

Unfortunately they do not share, what caused the status page to went down as well. (Does this happen often? Otherwise a big coincidence it seems)

show comments
zf00002

As an IT person, I wonder what it's like to work for a company like this. Where presumably IT stuff has a priority. Unlike the companies I've worked for where IT takes a backseat to everything until something goes wrong. Company I work had a huge new office built, with the plan it would be big enough for future growth, yet despite repeated attempts to reserve a larger space, our server room and infrastructure is actually smaller than our old building and has no room to grow.

show comments
vsgherzi

Why does cloudflare allow unwraps in their code? I would've assumed they'd have clippy lints stopping that sort of thing. Why not just match with { ok(value) => {}, Err(error) => {} } the function already has a Result type.

At the bare minimum they could've used an expect("this should never happen, if it does database schema is incorrect").

The whole point of errors as values is preventing this kind of thing.... It wouldn't have stopped the outage but it would've made it easy to diagnose.

If anyone at cloudflare is here please let me in that codebase :)

show comments
RagingCactus

Lots of people here are (perhaps rightfully) pointing to the unwrap() call being an issue. That might be true, but to me the fact that a reasonably "clean" panic at a defined line of code was not quickly picked up in any error monitoring system sounds just as important to investigate.

Assuming something similar to Sentry would be in use, it should clearly pick up the many process crashes that start occurring right as the downtime starts. And the well defined clean crashes should in theory also stand out against all the random errors that start occuring all over the system as it begins to go down, precisely because it's always failing at the exact same point.

show comments
HL33tibCe7

An unwrap like that in production code on the critical path is very surprising to me.

I haven’t worked in Rust codebases, but I have never worked in a Go codebase where a `panic` in such a location would make it through code review.

Is this normal in Rust?

show comments
dzonga

> thread fl2_worker_thread panicked: called Result::unwrap() on an Err value

I don't use Rust, but a lot of Rust people say if it compiles it runs.

Well Rust won't save you from the usual programming mistake. Not blaming anyone at cloudflare here. I love Cloudflare and the awesome tools they put out.

end of day - let's pick languages | tech because of what we love to do. if you love Rust - pick it all day. I actually wanna try it for industrial robot stuff or small controllers etc.

there's no bad language - just occassional hiccups from us users who use those tools.

show comments
dilyevsky

Long time ago Google had a very similar incident where ddos protection system ingested a bad config and took everything down. Except it was auto resolved in like four minutes by an automatic rollback system before oncall was even able to do anything. Perhaps Cloudflare should invest in a system like that

keypusher

The most surprising thing to me here is that it took 3 hours to root cause, and points to a glaring hole in the platform observability. Even taking into account the fact that the service was failing intermittently at first, it still took 1.5 hours after it started failing consistently to root cause. But the service was crashing on startup. If a core service is throwing a panic at startup like that, it should be raising alerts or at least easily findable via log aggregation. It seems like maybe there was some significant time lost in assuming it was an attack, but it also seems strange to me that nobody was asking "what just changed?", which is usually the first question I ask during an incident.

show comments
jdlyga

We shouldn't be having critical internet-wide outages on a monthly basis. Something is systematically wrong with the way we're architecting our systems.

show comments
trengrj

Classic combination of errors:

Having the feature table pivoted (with 200 feature1, feature2, etc columns) meant they had to do meta queries to system.columns to get all the feature columns which made the query sensitive to permissioning changes (especially duplicate databases).

A Crowdstrike style config update that affects all nodes but obviously isn't tested in any QA or staged rollout strategy beforehand (the application panicking straight away with this new file basically proves this).

Finally an error with bot management config files should probably disable bot management vs crash the core proxy.

I'm interested here why they even decided to name Clickhouse as this error could have been caused by any other database. I can see though the replicas updating causing flip / flopping of results would have been really frustrating for incident responders.

show comments
ademarre

I integrated Turnstile with a fail-open strategy that proved itself today. Basically, if the Turnstile JS fails to load in the browser (or in a few specific frontend error conditions), we allow the user to submit the web form with a dummy challenge token. On the backend, we process the dummy token like normal, and if there is an error or timeout checking Turnstile's siteverify endpoint, we fail open.

Of course, some users were still blocked, because the Turnstile JS failed to load in their browser but the subsequent siteverify check succeeded on the backend. But overall the fail-open implementation lessened impact to our customers nonetheless.

Fail-open with Turnstile works for us because we have other bot mitigations that are sufficient to fall back on in the event of a Cloudflare outage.

show comments
ed_mercer

Wow. 26M/s 5xx error HTTP status codes over a span of roughly two hours. That's roughly 187 billion HTTP errors that interrupted people (and systems)!

show comments
yoyohello13

People really like to hate on Rust for some reason. This wasn’t a Rust problem, no language would have saved them from this kind of issue. In fact, the compiler would have warned that this was a possible issue.

I get it, don’t pick languages just because they are trendy, but if any company’s use case is a perfect fit for Rust it’s cloudflare.

show comments
kqr

One of the remediations listed is

> Eliminating the ability for core dumps or other error reports to overwhelm system resources

but this is not mentioned at all in the timeline above. My best guess would be that the process got stuck in a tight restart loop and filled available disk space with logs, but I'm happy to hear other guesses for people more familiar with Rust.

show comments
testemailfordg2

"Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero."

This simply means, the exception handling quality of your new FL2 is non-existent and is not at par / code logic wise similar to FL.

I hope it was not because of AI driven efficiency gains.

show comments
pdimitar

While I heavily frown upon using `unwrap` and `expect` in Rust code and make sure to have Clippy tell me about every single usage of them, I also understand that without them Rust might have been seen as an academic curiosity language.

They are escape hatches. Without those your language would never take off.

But here's the thing. Escape hatches are like emergency exits. They are not to be used by your team to go to lunch in a nearby restaurant.

---

Cloudflare should likely invest in better linting and CI/CD alerts. Not to mention isolated testing i.e. deploy this change only to a small subset and monitor, and only then do a wider deployment.

Hindsight is 20/20 and we can all be smartasses after the fact of course. But I am really surprised because lately I am only using Rust for hobby projects and even I know I should not use `unwrap` and `expect` beyond the first iteration phases.

---

I have advocated for this before but IMO Rust at this point will benefit greatly from disallowing those unsafe APIs by default in release mode. Though I understand why they don't want to do it -- likely millions of CI/CD pipelines will break overnight. But in the interim, maybe a rustc flag we can put in our `Cargo.toml` that enables such a stricter mode? Or have that flag just remove all the panicky API _at compile time_ though I believe this might be a Gargantuan effort and is likely never happening (sadly).

In any case, I would expect many other failures from Cloudflare but not _this_ one in particular.

show comments
drc500free

Makes me wonder which team is responsible for that feature generating query, and if they follow full engineering level QA. It might be deferred to an MLE team that is better than the data scientists but less rigorous than software needs to be.

show comments
habibur

    On 18 November 2025 at 11:20 UTC (all times in this blog are UTC), Cloudflare's network began experiencing significant failures
    As of 17:06 all systems at Cloudflare were functioning as normal
6 hours / 5 years gives ~99.98% uptime.
show comments
throw7

Is this true: from that core proxy diagram, I didn't realize cloudflare sees the full unencrypted packet between you and the server.

If that's true, is there a way to tell (easily) whether a site is using cloudflare or not?

show comments
nwellinghoff

The real take away is that so much functionality depends on a few players. This is a fundamental flaw in design that is getting worse by the year as the winner takes all winners win. Not saying they didn’t earn their wins. But the fact remains. The system is not robust. Then again, so what. It went down for a while. Maybe we shouldn’t depend on the internet being “up” all the time.

arkanovicz

Interesting technical insight, but I would be curious to hear firsthand accounts from the teams on the ground, particularly regarding how the engineers felt the increasing pressure, frantically refreshing their dashboards, searching for phantom DDoS, scrolling codes updates...

aetherspawn

Cloudflare Access is still experiencing weird issues for us (it’s asking users to SSO login to our public website even though our zone rules - set on a completely different zone - haven’t changed).

I don’t think the infrastructure has been as fully recovered as they think yet…

Chihuahua0633

> The first automated test detected the issue at 11:31 and manual investigation started at 11:32. The incident call was created at 11:35.

I'm impressed they were able to corral people this quickly.

tristan-morris

Why call .unwrap() in a function which returns Result<_,_>?

For something so critical, why aren't you using lints to identify and ideally deny panic inducing code. This is one of the biggest strengths of using Rust in the first place for this problem domain.

show comments
MagicMoonlight

Having a system which automatically deploys configuration files across a million servers every 5 minutes without testing it seems stupid to me.

Barry-Perkins

The Cloudflare outage on November 18, 2025 highlights how critical internet infrastructure dependencies can impact services globally. The post-mortem provides transparency on root causes and recovery, offering valuable lessons in resilience and incident management.

show comments
jokoon

I don't understand what's the business of cloudflare.

They just sell proxies, to whoever.

Why are they the only company doing ddos protection?

I just don't get it.

show comments
keepamovin

That's interesting. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network. It's like the issues with HOSTS.TXT needing to be copied among the network of the early internet to allow routing (taking days to download etc) and DNS having to be created to make that propagation less unwieldy.

spprashant

A lot of outages off late seem to be related to automated config management.

Companies seem to place a lot of trust is configs being pushed automatically without human review into running systems. Considering how important these configs are, shouldn't they perhaps first be deployed to a staging/isolated network for a monitoring window before pushing to production systems?

Not trying to pontificate here, these systems are more complicated than anything I have maintained. Just trying to think of best practices perhaps everyone can adopt.

chaos_emergent

Just a moment to reflect on how much freaking leverage computers give us today - a single permission change took down half the internet. Truly crazy times.

abigailphoebe

kudos to getting this blog post out so fast, it’s well written and is appreciated.

i’m a little confused on how this was initially confused for an attack though?

is there no internal visibility into where 5xx’s are being thrown? i’m surprised there isn’t some kind of "this request terminated at the <bot checking logic>" error mapping that could have initially pointed you guys towards that over an attack.

also a bit taken aback that .unwrap()’s are ever allowed within such an important context.

would appreciate some insight!

show comments
ksajadi

May I just say that Matthew Prince is the CEO of Cloudflare and a lawyer by training (and a very nice guy overall). The quality of this postmortem is great but the fact that it is from him makes one respect the company even more.

130R

If the software has a limit on the size of the feature file then the process that propagates the file should probably validate the size before propagating ..

BrtByte

This incident feels like a strong argument for stricter guardrails around internal config propagation

cmilton

How many changes to production systems does Cloudflare make throughout a day? Are they a part of any change management process? That would be the first place I would check after a random outage, recent changes.

l___l

> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.

What could have prevented this failure?

Cloudflare's software could have included a check that refused to generate the feature file if it's size was higher than the limit.

A testcase could have caught this.

gkoz

Given this was triggered by an old school configuration change across multiple servers, there's too little discussion of that particular process.

It sounds like the change could've been rolled out more slowly, halted when the incident started and perhaps rolled back just in case.

zhisme

Thank you for being honest, all must learn from the mistakes.

baalimago

Interesting. Although principle of least privilege is great, it should not be applied as a feature to filter data.

cvhc

I don't get why that SQL query was even used in the first place. It seems it fetches feature names at runtime instead of using a static hardcoded schema. Considering this decides the schema of a global config, I don't think the dynamicity is a good idea.

kylegalbraith

The outage sucked for everyone. The root cause also feels like something they could have caught much earlier in a canary rollout from my reading of this.

All that said, to have an outage reported turned around practically the same day, that is this detailed, is quite impressive. Here's to hoping they make their changes from this learning, and we don't see this exact failure mode again.

show comments
avereveard

Question: customer having issues also couldn't switch their dns to bypass the service, why is the control plane updated along the data plane here it seem a lot of use could save business continuity if they could change their dns entry temporarily

__alexs

Is dual sourcing CDNs feasible these days? Seems like having the capability to swap between CDN providers is good both from a negotiating perspective and a resiliency one.

darksideofthem

Speaking of resiliency, the entire Bot Management module doesn't seems to be a critical part of the system, so for example, what happens if that module goes down for an hour? the other parts of the system should work. So I would rank every module and it's role in the system, and would design it in a way that when a non-critical module fails, other parts still can function.

igornadj

Any feature failing should still allow the traffic to continue. This should be the first bullet in the future actions list.

leonaves

Why have a limit on the file size if the thing that happens when you hit the limit is the entire network goes down? Surely not having a limit can't be worse?

vasuadari

Wondering why they didn’t disable the bot management temporarily to recover. Websites could have survived temporarily without it compared to the outage itself.

zeroq

But Rust was supposed to cure cancer and solve world hunger. Is this the end of the hello world but in Rust saga?

baalimago

Git blame disabled on the line which crashed it - Cowards!

JamesJGoodwin

>Currently that limit is set to 200, well above our current use of ~60 features. Again, the limit exists because for performance reasons we preallocate memory for the features.

So they basically hardcoded something, didn't bother to cover the overflow case with unit tests, didn't have basic error catching that would fallback and send logs/alerts to their internal monitoring system and this is why half of the internet went down?

wildmXranat

Hold up ,- when I used a C or similar language for accessing a database and wanted to clamp down on memory usage to deterministically control how much I want to allocated, I would explicitly limit the number of rows in the query.

There never was an unbound "select all rows from some table" without a "fetch first N rows only" or "limit N"

If you knew that this design is rigid, why not leverage the query to actually do it ?

What am I missing ?

show comments
elAhmo

Timely post mortem. Sucks to have this happened, but at least they are quite transparent and detailed in the writeup.

niedbalski

Sure this has been said but the issue here is not code is the ability to canary and rollback quickly from any arbitrary (config) change.

hbarka

ClickHouse db was mentioned. Does this incident raise any critiques about it?

markhandoff

Dear Matthew Prince, don't you think we (the ones affected by your staff's mistake) should get some sort of compensation??? Yours truly, a Cloudflare client who lost money during the November 18th outage.

nromiun

Unbelievable. I guess it's time to grep for every .unwrap in our code.

show comments
agonux

Time to rewrite with golang, explicit error handling ;-)

zyngaro

Catastrophique failure for failing to read a file bigger that expected? Wow. This is really embarrassing.

arjie

Great post-mortem. Very clear. Surprised that num(panicking threads) didn't show up somewhere in telemetry.

kjgkjhfkjf

Seems like a substantial fraction of the web was brought down because of a coding error that should have been caught in CI by a linter.

These folks weren't operating for charity. They were highly paid so-called professionals.

Who will be held accountable for this?

cvshane

Would be nice if their Turnstile could be turned off on their login page when something like this happens, so we can attempt to route traffic away from Cloudflare during the outage. Or at least have a simple app where this can be modified from.

laurentiurad

Reason for the failure: switched to Chad IDE to ship new features.

1970-01-01

I would have been a bit cheeky and opened with 'It wasn't DNS.'

sanjitb

cloudflare:

> Throwing us off and making us believe this might have been an attack was another apparent symptom we observed: Cloudflare’s status page went down. The status page is hosted completely off Cloudflare’s infrastructure with no dependencies on Cloudflare.

also cloudflare:

> The Cloudflare Dashboard was also impacted due to both Workers KV being used internally and Cloudflare Turnstile being deployed as part of our login flow.

show comments
BolexNOLA

I’ll be honest, I only understand about 30% of what is being said in this thread and that is probably generous. But it is very interesting seeing so many people respond to each other “it’s so simple! what went wrong was…” as they all disagree on what exactly went wrong.

sema4hacker

If you deploy a change to your system, and things start to go wrong that same day, the prime suspect (no matter how unlikely it might seem) should be the change you made.

show comments
sigmar

Wow. What a post mortem. Rather than Monday morning quarterbacking how many ways this could have been prevented, I'd love to hear people sound-off on things that unexpectedly broke. I, for one, did not realize logging in to porkbun to edit DNS settings would become impossible with a cloudflare meltdown

show comments
assbuttbuttass

My website was down too, because a tree fell on my power line

slyall

Ironically just now I got a Cloudflare "Error code 524" page because blog.cloudflare.com was down

themark

“…and the fluctuation stabilized in the failing state.”

Sounds like the ops team had one hell of a day.

CSMastermind

I'm honestly curious what culturally is going on inside Cloudflare given they've had a few outages this year.

alhirzel

> I worry this is the big botnet flexing.

Even worse - the small botnet that controls everything.

back_to_basics

While it's certainly worthwhile to discuss the Technical and Procedural elements that contributed to this Service Outage, the far more important (and mutually-exclusive aspect) to discuss should be:

Why have we built / permitted the building of / Subscribed to such a Failure-intolerant "Network"?

show comments
Adam2025

Cloudflare’s write-up is clear and to the point. A small change spread wider than expected, and they explained where the process failed. It’s a good reminder that reliability depends on strong workflows as much as infrastructure.

0x001D

Configuration can be validated. https://cuelang.org

nanankcornering

Matt, Looking forward in regaining Elon's and his team trust to use CF again.

show comments
mmaunder

tl;dr A permissions change in a ClickHouse database caused a query to return duplicate rows for a “feature file” used by Cloudflares Bot Management system, which doubled the file size. That oversized file was propagated to their core proxy machines, triggered an unhandled error in the proxy’s bot-module (it exceeded its pre-allocated limit), and as a result the network started returning 5xx errors. The issue wasn’t a cyber-attack — it was a configuration/automation failure.

jeffrallen

This is an excellent lesson learned: Harden loading of internally generated config files as though they were untrusted content.

Gonna use that one at $WORK.

dev_l1x_be

Was it DNS this time?

keiywuvfwofw
show comments
nullbyte808

I thought it was an internal mess-up. I thought an employee screwed a file up. Old methods are sometimes better than new. AI fails us again!

aspbee555

unwraps are so very easy to use and they have bit me so many times because you can nearly never run into a problem and suddenly crashes from an unwrap that almost always was fine

zzzeek

> Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system.

And here is the query they used ** (OK, so it's not exactly):

     SELECT * from feature JOIN permissions on feature.feature_type_id = permissions.feature_type_id
someone added a new row to permissions and the JOIN started returning two dupe feature rows for each distinct feature.

** "here is the query" is used for dramatic effect. I have no knowledge of what kind of database they are even using much less queries (but i do have an idea).

more edits: OK apparently it's described later in the post as a query against clickhouse's table metadata table, and because users were granted access to an additional database that was actually the backing store to the one they normally worked with, some row level security type of thing doubled up the rows. Not sure why querying system.columns is part of a production level query though, seems overly dynamic.

show comments
chatmasta

Wow, crazy disproportional drop in the stock price… good buying opportunity for $NET.

show comments
anal_reactor

Honestly... everyone shit themselves that internet doesn't work, but next week this outage will be forgotten by 99% of population. I was doing something on my PC when I saw clear information that Cloudflare is down, so I decided to just go take a nap, then read a book, then go for a walk. Once I was done, the internet was working again. Panic was not necessary on my side.

What I'm trying to say is that things would be much better if everyone took a chill pill and accepted the possibility that in rare instances, the internet doesn't work and that's fine. You don't need to keep scrolling TikTok 24/7.

> but my use case is especially important

Take a chill pill. Probably it isn't.

show comments
Martcpp

deny (clippy:: unwrap_used)

nawgz

> a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system ... to keep [that] system up to date with ever changing threats

> The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail

A configuration error can cause internet-scale outages. What an era we live in

Edit: also, after finishing my reading, I have to express some surprise that this type of error wasn't caught in a staging environment. If the entire error is that "during migration of ClickHouse nodes, the migration -> query -> configuration file pipeline caused configuration files to become illegally large", it seems intuitive to me that doing this same migration in staging would have identified this exact error, no?

I'm not big on distributed systems by any means, so maybe I'm overly naive, but frankly posting a faulty Rust code snippet that was unwrapping an error value without checking for the error didn't inspire confidence for me!

show comments
lofaszvanitt

It is staggering to see that even large companies like CF have zero monitoring, so they would know what happened in t=0.

xlii

Cloudflare rewrites Rust services to <next-cool-language> /joke

...

(I'd pick Haskell, cause I'm having fun with it recently :P)

nurettin

I can never get used to the error happening at call site rather than within the function where the early return of Err happened. It is not "much cleaner", you have no idea which line and file caused it at call site. By default Returning should have a way of setting a marker which can then be used to map back to the line() and file(). 10+ years and still no ergonomics.

AtNightWeCode

So they made a newbie mistake in SQL that would not even pass an AI review. They did not verify the change in a test environment. And I guess the logs are so full of errors it is hard to pinpoint which matters. Yikes.

lapcat

It's unbelievable that the end of this postmortem is an advertisement for Cloudflare.

The last thing we need here is for more of the internet to sign up for Cloudflare.

show comments
ulfw

The internet hasn't been the internet in years. It was originally built to withstand wars. The whole idea of our IP based internet was to reroute packages should networks go down. Decentralisation was the mantra and how it differed from early centralised systems such as AOL et al.

This is all gone. The internet is a centralised system in the hand of just a few companies. If AWS goes down half the internet does. If Azure, Google Cloud, Oracle Cloud, Tencent Cloud or Alibaba Cloud goes down a large part of the internet does.

Yesterday with Cloudflare down half the sites I tried gave me nothing but errors.

The internet is dead.

show comments
lalam

Hack Free fire 8000 5487 3565 644664 464664644

464646449

robofanatic

hope no one was fired

lalam

F

makach

Excellent write up. Cybersecurity professionals read the story and learn. It’s textbook lesson in post-mortem incident analysis - a mvp for what is expected from us all in a similar situation.

Reputationally this is extremely embarrassing for Cloudflare, but imo they seem to get their feet back on the ground. I was surprised to see not just one, but two apologies to the internet. This just cements how professional and dedicated the Cloudflare team is to ensure stable resilient internet and how embarrassed they must have been.

A reputational hit for sure, but outcome is lessons learned and hopefully stronger resilience.

xyst

A fucking unhandled exception brought down a majority of the internet? Why do we continue to let these clowns run a large portion of the internet?

Big tech is a fucking joke.

uecker

So an unhandled error condition after an configuration update similar to Crowdstrike - if they had just used a programming language where this can't happen due to the superior type system such as Rust. Oh wait.

weihz5138

Glad it's fixed, keep going!

jijji

this is where change management really shines because in a change management environment this would have been prevented by a backout procedure and it would never have been rolled out to production before going into QA, with peer review happening before that... I don't know if they lack change management but it's definitely something to think about

show comments
binarymax

28M 500 errors/sec for several hours from a single provider. Must be a new record.

No other time in history has one single company been responsible for so much commerce and traffic. I wonder what some outage analogs to the pre-internet ages would be.

show comments
lalam

Hack

Free fire

moralestapia

No publicity is bad publicity.

Best post mortem I've read in a while, this thing will be studied for years.

A bit ironic that their internal FL2 tool is supposed to make Cloudflare "faster and more secure" but brought a lot of things down. And yeah, as other have already pointed out, that's a very unsafe use of Rust, should've never made it to production.

homeonthemtn

Another long day...

awesome_dude

But but but muh rust makes EVERYTHING safer!!!!

My dude, everything is a footgun if you hold it wrong enough

rvz

Great write up.

This is the first significant outage that has involved Rust code, and as you can see the .unwrap is known to carry the risk of a panic and should never be used on production code.

sachahjkl

me af when there's a postmortem rubbing hands, impish smile on my face

issafram

I give them a pass on lots of things, but this is inexcusable

snoppy45

I think you should give me a credit for all the income I lost due to this outage. Who authorized a change to the core infrastructure during the period of the year when your customers make the most income? Seriously, this is a management failure at the highest levels of decision-making. We don't make any changes to our server infrastructure/stack during the busiest time of the year, and neither should you. If there were an alternative to Cloudflare, I'd leave your service and move my systems elsewhere.

show comments
keiywuvfwofw
show comments
wileydragonfly

Did some $300k chief of IT blame it all on some overworked secretary clicking a link in an email they should have run through a filter? Because that’s the MO.

show comments
rawgabbit

     > The change explained above resulted in all users accessing accurate metadata about tables they have access to. Unfortunately, there were assumptions made in the past, that the list of columns returned by a query like this would only include the “default” database:

  SELECT
  name,
  type
  FROM system.columns
  WHERE
  table =        'http_requests_features'
  order by name;

    Note how the query does not filter for the database name. With us gradually rolling out the explicit grants to users of a given ClickHouse cluster, after the change at 11:05 the query above started returning “duplicates” of columns because those were for underlying tables stored in the r0 database.
show comments
0xbadcafebee

So, to recap:

  - Their database permissions changed unexpectedly (??)
  - This caused a 'feature file' to be changed in an unusual way (?!)
     - Their SQL query made assumptions about the database; their permissions change thus resulted in queries getting additional results, permitted by the query
  - Changes were propagated to production servers which then crashed those servers (meaning they weren't tested correctly)
     - They hit an internal application memory limit and that just... crashed the app
  - The crashing did not result in an automatic backout of the change, meaning their deployments aren't blue/green or progressive
  - After fixing it, they were vulnerable to a thundering herd problem
  - Customers who were not using bot rules were not affected; CloudFlare's bot-scorer generated a constant bot score of 0, meaning all traffic is bots
In terms of preventing this from a software engineering perspective, they made assumptions about how their database queries work (and didn't validate the results), and they ignored their own application limits and didn't program in either a test for whether an input would hit a limit, or some kind of alarm to notify the engineers of the source of the problem.

From an operations perspective, it would appear they didn't test this on a non-production system mimicing production; they then didn't have a progressive deployment; and they didn't have a circuit breaker to stop the deployment or roll-back when a newly deployed app started crashing.

show comments
jamesblonde

Cloudflare tried to build their own feature store, and get a grade F.

I wrote a book on feature stores by O'Reilly. The bad query they wrote in Clickhouse could have been caused by another more error - duplicate rows in materialized feature data. For example, in Hopsworks it prevents duplicate rows by building on primary key uniqueness enforcement in Apache Hudi. In contrast, Delta lake and Iceberg do not enforce primary key constraints, and neither does Clickhouse. So they could have the same bug again due to a bug in feature ingestion - and given they hacked together their feature store, it is not beyond the bounds of possibility.

Reference: https://www.oreilly.com/library/view/building-machine-learni...