Today is when the Amazon brain drain sent AWS down the spout

679 points311 comments14 hours ago

president_zippy

Between the engineering staff and the warehouse workers, I wonder how long it will be until they have already fired everyone who ever would have been willing to work there.

Even with candidate pools of hundreds of thousands of H1-B engineers and tens of millions of illegal immigrant warehouse workers, there still comes a point where such a big company firing so many people so quickly exhausts all their options.

It reminds me of the Robot Chicken Sketch where Imperial Officers aboard the Death Star all pretend to be force choked to death by Darth Vader so they can avoid getting killed by lightsaber, then come back in under different names in different jobs. It's worse though for Amazon: nobody wants to come back.

https://www.youtube.com/watch?v=fFihTRIxCkg

show comments

tcgv

According to the article, the issue was caused by:

> "engineers had identified DNS resolution of the DynamoDB API endpoint for US-EAST-1 as the likely root cause"

Interestingly, we found matching errors in our own logs:

> System.Net.WebException

> The remote name could not be resolved: 'dynamodb.us-east-1.amazonaws.com'

Occurrences were recorded on:

- 2025-04-30

- 2025-05-29

- 2025-06-17

- Yesterday

We had logged this as a low-priority bug since previous incidents only affected our AWS testing environments (and never our production env which is on Azure). At the time, we assumed it was some CI/CD glitch.

It now seems that the underlying cause was this DNS issue all along, and only yesterday did it start impacting systems outside of AWS.

andrewflnr

Look, not to defend anything Amazon is doing, but this causal chain seems rather pareidolic and under-evidenced. You could spin some kind of crazy narrative about any major outage based on changes in policy that happened just before. But this isn't nearly the first AWS outage, and most of them happened before the recent RTO changes. It needs more evidence at best.

show comments

mlhpdx

This is the time to accept that the path forward is keeping people and giving them the best tools you possibly can to do their work. That is, the same as has been true for decades remains so.

Yes, development tools are better every day. Yes, you can downsize. No it won’t be felt immediately. Yes, it mortgages the future and at a painfully high interest rate.

Suspending disbelief won’t make downsizing work better.

show comments

shelled

It was Diwali vacation in India. It looks like the managers were not able to force everyone to walk around with their laptops and pagers hanging from their necks and waists, respectively, which they normally do.

If there's one thing I have learned from my Amazon mates, then that is they never have a true time off. Hills, beaches, a marriage in the family— no exceptions. It's so pervasive that I can't really imagine it to be voluntary, and my friends' answers on this topic have never been clear.

show comments

crmd

I wish to understand the virtue of Amazon culture.

It seems that at L6 and below workers are a Taylorism-style fungible widget driven to convert salary into work product, guided to create the most output for the longest time before mentally breaking down, then being swiftly replaced, with L7 and above being so incredibly political that keeping the snakes and vultures from eating your team is a full time job at every level of senior management.

It never made sense to me how such a ruthless and inhumane culture is sustainable in the long run.

I would love to hear positive counter perspectives from Amazonians because the anecdotes from my L6-L10 friends describe what sounds like an inhumane hell on earth.

show comments

nijave

It was certainly suspicious that actual progress on the outage seemed to start right around U.S. west coast start of day. Updates before that were largely generic "we're monitoring and mitigating" with nothing of substance.

show comments

pinkmuffinere

> one really gets the sense that it took them 75 minutes to go from "things are breaking" to "we've narrowed it down to a single service endpoint, but are still researching," which is something of a bitter pill to swallow

Is 75 minutes really considered that long of a time? I don't do my day-job in webdev, so maybe I'm just naive. But being able to diagnose the single service endpoint in 75 minutes seems pretty good to me. When I worked on firmware we frequently spent _weeks_ trying to diagnose what part of the firmware was broken.

show comments

pdonis

One thing I love about El Reg is that they never shirk from calling a spade a spade.

show comments

neilv

AWS is still my overall favorite cloud provider, and I use it very effectively.

I would've even liked to work at AWS myself, if it were clear that they're solving a few concerns:

1. Rumors of rough corporate culture, and you needing your manager to shield you from it. (If it can't be immediately solved for all of Amazon or white-collar, maybe start with increasing job-seeker confidence for AWS or per-team.)

2. Even very experienced engineer candidates must go through some silly corporate coding screen, and an interview to make sure they've memorized some ritual STAR answers about Leadership Principles. If your prospective manager can't even get you out of that, what worse corporate things can't they shield you from?

3. RTO. As well as all the claims it wasn't done consistent with the Leadership Principles, and claims that it's not about working effectively.

4. Difficult-sounding on-call rotation, for people who aren't shift workers. (Even if you Principal out of on-call, you don't want your teammates to be overextended, nor to have awkwardness because you're getting a more consistent sleep schedule that is denied them.)

Also, not a concern, but an idea that applies to all the FAANGs lately: What about actively renewing the impression in the field that this is a place where people who are really good go? Meta's historical approach seems to be to pay better, and to release prominent open source code, and be involved in open hardware. Google (besides having a reputation for technical/competence excellence and warmer values) historically had a big frat-pledging mystique going on, though it turned into a ritual transaction, and everyone optimized for that ritual. AWS has a lot of technical/competence excellence to be proud of, and could make sure that they're investing in various facets of that, including attracting and retaining the best workers, and helping them be most effective, and then making sure the field knows that.

show comments

whatever1

Tech will learn like manufacturing folks did that experience is not fungible. You can try to replace someone, but the new guy also needs to accumulate the scars from the system for years before taking over.

You cannot just keep abstracting and chopping systems to smaller and smaller subsystems to make them easy to digest.

At some point someone needs to know how these coordinate and behave under disturbances. At some point someone needs to know at a low level what the hell is going on.

Esophagus4

I’ve seen this happen with startups as well -

They’ll get acquired and top people leave as their stock vests or get pushed out because the megacorp wants someone different in the seat.

The people who knew the tech are gone and you’re left with an unmaintainable mess that becomes unreliable and no one knows how to fix it.

ChrisMarshallNY

> When that tribal knowledge departs, you're left having to reinvent an awful lot of in-house expertise that didn't want to participate in your RTO games, or play Layoff Roulette yet again this cycle.

…

> This is a tipping point moment. Increasingly, it seems that the talent who understood the deep failure modes is gone. The new, leaner, presumably less expensive teams lack the institutional knowledge needed to, if not prevent these outages in the first place, significantly reduce the time to detection and recovery. Remember, there was a time when Amazon's "Frugality" leadership principle meant doing more with less, not doing everything with basically nothing. AWS's operational strength was built on redundant, experienced people, and when you cut to the bone, basic things start breaking.

Not just Amazon. I woke up this morning, to find my iCloud inbox stuffed with unread spam; much of it over a month old. Looks like Apple restored some old backup. This was likely to correct some issues that were caused by the AWS outage; either directly, or indirectly.

It’s nice to know that Apple backs up the torrents of spam that I get.

Everything is now at Jurassic-scale. It’s all monstrously big. There’s no such thing as a “small problem,” anymore.

One thing that you get with experience, is “tribal knowledge,” and that stuff is usually impossible to properly document. I suspect that AI may, in the future, be able to incorporate some of this, but it’s by no means certain.

citizenpaul

Its almost like institutional knowledge is a real thing that you cannot put on some BIG BRAIN MBA spreadsheet.

show comments

1vuio0pswjnm7

""It's always DNS" is a long-standing sysadmin saw, and with good reason: a disproportionate number of outages are at their heart DNS issues. And so today, as AWS is still repairing its downed cloud as this article goes to press, it becomes clear that the culprit is once again DNS. "

I use stored DNS data.^1 The data is collected periodically and stored permanently

I seem to be unaffected by DNS-based outages

I use stored data because it is faster, e.g., faster than using DNS caches like Google DNS, Cloudflare DNS, etc., but there are obviously other other benefits

1. When I make HTTP requests there is no corresponding remote DNS query. The IP address is stored in the memory of the localhost-bound forward proxy

show comments

bithead

What - their AI couldn't find it sooner? Better get those RAGs in order.

jadenPete

This article seems sensationalized and lacking evidence. Layoffs alone (especially when so much of the industry did them) doesn't seem sufficient to explain today's outage, especially when we know so little of the technical details behind it. It's disappointing that The Register didn't wait until we had a postmortem from AWS before jumping to conclusions.

show comments

igleria

Dunno about brain drain but some departments seem to be having a mandate of "must have AI" when procuring products. Pump and keep pumping!

liampulles

This AWS outage has reminded and bolstered my confidence of the idea that there really are practical limits on how we can manage complexity.

As a codebase ages, as services grow out in scale and scope, complexity increases. Developers know this. I don't believe that you can linearly scale your support to accommodate the resulting unknown unknowns that arise. I'm not even sure you can exponentially scale your support for it. There is going to be a minimum expected resolution time set by your complexity that you cannot go under.

I think times where there have been outages like this that have been resolved quickly are the exceptional cases, this is the norm we should expect.

show comments

jqpabc123

Nothing gets sold or fixed without people who know how it's built.

show comments

1dom

We all watched this happen across FAANG, right? In the early/mid 2010s working at Amazon meant you were cream of the crop.

By 2020, no engineer in their right mind wanted to work there because it was an infamously bad employer for people who wanted to create great tech in a nerdy-fun environment.

The AI space is showing how the "darling fun tech company" to "aggressive tech employer full of psychopaths" trope can take less than a few years now!

ortusdux

"If you were a ‘product person’ at IBM or Xerox: so you make a better copier or better computer. So what? When you have a monopoly market-share, the company’s not any more successful. So the people who make the company more successful are the sales and marketing people, and they end up running the companies. And the ‘product people’ get run out of the decision-making forums.

The companies forget how to make great products. The product sensibility and product genius that brought them to this monopolistic position gets rotted out by people running these companies who have no conception of a good product vs. a bad product. They have no conception of the craftsmanship that’s required to take a good idea and turn it into a good product. And they really have no feeling in their hearts about wanting to help the costumers.”

- Steve Jobs - https://en.wikipedia.org/wiki/Steve_Jobs:_The_Lost_Interview

show comments

anigbrowl

One of the best-written articles I've read in a long time. I wish general news coverage had this tight blend of fact, context, and long-term perspective.

lgregg

I think they might have deeper issues still with their outage. I just got an email and retroactive charge for something I returned months ago and shows as returned on their own orders portal. The link in their transactional email also links to a totally different product.

show comments

aussieguy1234

The PIP culture of AWS sounds horrifying. As a decent engineer, I would not work there unless that is addressed.

I heard its as bad as this. Take a team of 5 genius engineers, the best 5 in the world.

There is a PIP quota, so one of the genius engineers must be PIP'ed, despite being in the top 5 engineers globally.

show comments

chicagobuss

internal reports from current AWS engineers seem to be confirming all of the speculation in this article. Shit's rotten from the inside out and you can pretty evenly blame AI, brain drain, and good old fashioned "big company politics"

https://forums.theregister.com/forum/all/2025/10/20/aws_outa...

show comments

Trisell

I’m pretty sure at this point I know more about AWS and AWS internals than my account solution architect and I’ve never worked for AWS.

fnordpiglet

With cbell gone and ajassy promoted and the misery inducing litany of morale self inflicted wounds it’s not surprising reliability is regressing. There’s no head of engineering like Charlie and Garman is a strong engineering leader, but coming after a sales guy and taking over a battered workforce, it’s not clear to me he can turn things around easily. Everyone I know worth a hill of salt left aws already - and the ones left - meh. That’s how attrition through misery works.

jnaina

When organizations begin to prioritize personal-brand builders and performative hires over the core technologists and long-tenured institutional experts who actually understand how things work, the culture inevitably shifts.

When that imbalance grows, as it has at AWS (ex-AWS here), and the volume of relentless self-promoting “LinkedIn personalities” and box-ticking DEI appointments starts to outnumber the true builders and stewards of institutional memory, the execution quality, accountability, and technical excellence begin to erode.

It is becoming increasingly clear that Andy Jassy’s leadership is no longer effective, and it is only a matter of time before Wall Street begins calling for his departure.

show comments

mmonaghan

Ehh I trust the reporting and generally agree that RTO was/is executed hamfisted but I dunno if this particular incident "makes" the narrative. IIRC LSE rate has been increasing for many years, maybe most of AWS's existence. This is part and parcel of building something so complex that continues to grow and evolve.

I do expect much better of them and they certainly have problems to solve but this is a big company evolution thing and not an Amazon-specific thing imo.

show comments

vehementi

Mere hours into the incident, before there's any RCA, someone rushes to discredit themselves with a simplistic explanation

causal

Yeah. They will identify the cause, but not the cause behind the cause.

neilv

> This is The Register, a respected journalistic outlet.

Yes, but they bristle at the thought. :)

g-b-r

For those who worked there recently, how much does the comment at [1] reflect the current state of things?

[1] https://forums.theregister.com/forum/all/2025/10/20/aws_outa...

throw-10-13

aws is s globally centralized point of failure, it should not be allowed to exist

homeonthemtn

There's an argument to be made that each event generates new institutional knowledge for those that are there.

Bit of a double edged sword.

show comments

Dig1t

>Amazon remained the single largest H-1B sponsor, increasing approvals from 9,257 in 2024 to 10,044 in 2025, an addition of 787 visas.

https://www.reddit.com/r/SeattleWA/comments/1ncm25p/amazon_m...

I’m confused how they can have such a failure, they are employing the best and brightest top tier talent from India.

Hopefully they can increase their H1B allotment even more next year to help prevent these types of failures.

show comments

NKosmatos

This is how articles should be written, this is why I’m reading El Reg (a.k.a. The Register) all these decades, this is what happens when high management cares only about profits and when real engineers don’t eat the RTO bullshit. Bravo for putting this online.

P.S. I’m not an Amazon hater, replace the company name with any other big one of your choice and the article will have the same meaning ;-)

show comments

donavanm

Terrible article. Im ex-AWS, left as a principal after 10 years to go take another global megacorps shilling. I dont even disagree with the premise, but its so clearly a predetermined conclusion written as opinion piece to fit the hot news topic.

Ex a sloppy as hell and inconsistent premise.

> engineers had identified DNS resolution of the DynamoDB API endpoint for US-EAST-1 as the likely root cause

its the point that wasnt the root cause. The root cause was ipso facto much more complex, insidious, and unobservably spooky action at a distance. I say that not knowing the true cause but being very willing to bet a bunch of AMZN that it wasnt as simple as “herp derp dns is hard and college hire sdes dont understand octets and delegation.”

Or this stupid citation if were talking about senior/long term AWS tech roles:

> Internal documents reportedly say that Amazon suffers from 69 percent to 81 percent regretted attrition across all employment levels.

The citation _appears_ to be about consumer/retail delivery and ops folks. And how 69-80% _of total attrition is RA_. While el reg has written it trying to imply 80% _annual attrition_ in a completely different org and business unit.

So I know corey isnt stupid, and hot takes are his paycheck. But does he think his readers are stupid?

show comments

the_real_cher

AWS has been having issues like this for years.

charcircuit

This fails to recognize that the people who designed everything to rely on us east 1 did so a long time ago. "Brain drain" could just mean that they've had their fun and now want other people to deal with their mess.

>I've seen zero signs that this stems from a lack of transparency, and every indication that they legitimately did not know what was breaking for a patently absurd length of time.

That information is under NDA, so it's only natural you aren't privy to it.

ChrisArchitect

More discussion: https://news.ycombinator.com/item?id=45640838

whateveracct

glad a company who did RTO got fucked

hope it only gets worse for them

benjaminclauss

big "who is John Galt" vibes in these comments lol

add-sub-mul-div

Amazon has reportedly been a shitty place to work forever, so using issues that happen to be popular today to explain turnover is disingenuous.

show comments

jongjong

Speaking of DNS, I still cannot comprehend why we still rely on the current complex, aging, centralized, rent-seeking DNS.

It's one one of the few parts of the internet which could potentially be replaced over time with very little disruption.

The hierarchy of resolvers could be replaced with a far simpler flat hierarchy Blockchain where people could buy and permanently own their domains directly on-chain... No recurring fees. People could host websites on the Blockchain from beyond the grave... This is kind of a dream of mine. Not possible to achieve in our current system.

show comments

random9749832

https://geohot.github.io/blog/jekyll/update/2025/09/13/get-o...

show comments

sporkland

Garbage reporting: 1. AWS had an outage 2. AWS has lost a lot of employees

Conclusion: The brain drain lead to the outage...

I need an LLM trained explicitly on folks confusing correlation and causation and put a big old red dot in my address bar.

I love that there's a whole section "The talent drain evidence" trying to defend their journalistic integrity, but they then go on to totally face plant.

behnamoh

> It is a fact that there have been 27,000+ Amazonians impacted by layoffs between 2022 and 2024, continuing into 2025. It's hard to know how many of these were AWS versus other parts of its Amazon parent, because the company is notoriously tight-lipped about staffing issues. Internal documents reportedly say that Amazon suffers from 69 percent to 81 percent regretted attrition across all employment levels. In other words, "people quitting who we wish didn't." The internet is full of anecdata of senior Amazonians lamenting the hamfisted approach of their Return to Office initiative; experts have weighed in citing similar concerns.

So the title is all speculation. The author put 2 and 2 together and concluded that 10 is greater than 9.

Worthless article.

show comments