Between the engineering staff and the warehouse workers, I wonder how long it will be until they have already fired everyone who ever would have been willing to work there.
Even with candidate pools of hundreds of thousands of H1-B engineers and tens of millions of illegal immigrant warehouse workers, there still comes a point where such a big company firing so many people so quickly exhausts all their options.
It reminds me of the Robot Chicken Sketch where Imperial Officers aboard the Death Star all pretend to be force choked to death by Darth Vader so they can avoid getting killed by lightsaber, then come back in under different names in different jobs. It's worse though for Amazon: nobody wants to come back.
Look, not to defend anything Amazon is doing, but this causal chain seems rather pareidolic and under-evidenced. You could spin some kind of crazy narrative about any major outage based on changes in policy that happened just before. But this isn't nearly the first AWS outage, and most of them happened before the recent RTO changes. It needs more evidence at best.
show comments
mlhpdx
This is the time to accept that the path forward is keeping people and giving them the best tools you possibly can to do their work. That is, the same as has been true for decades remains so.
Yes, development tools are better every day. Yes, you can downsize. No it won’t be felt immediately. Yes, it mortgages the future and at a painfully high interest rate.
Suspending disbelief won’t make downsizing work better.
show comments
crmd
I wish to understand the virtue of Amazon culture.
It seems that at L6 and below workers are a Taylorism-style fungible widget driven to convert salary into work product, guided to create the most output for the longest time before mentally breaking down, then being swiftly replaced, with L7 and above being so incredibly political that keeping the snakes and vultures from eating your team is a full time job at every level of senior management.
It never made sense to me how such a ruthless and inhumane culture is sustainable in the long run.
I would love to hear positive counter perspectives from Amazonians because the anecdotes from my L6-L10 friends describe what sounds like an inhumane hell on earth.
show comments
shelled
It was Diwali vacation in India. It looks like the managers were not able to force everyone to walk around with their laptops and pagers hanging from their necks and waists, respectively, which they normally do.
If there's one thing I have learned from my Amazon mates, then that is they never have a true time off. Hills, beaches, a marriage in the family— no exceptions. It's so pervasive that I can't really imagine it to be voluntary, and my friends' answers on this topic have never been clear.
show comments
nijave
It was certainly suspicious that actual progress on the outage seemed to start right around U.S. west coast start of day. Updates before that were largely generic "we're monitoring and mitigating" with nothing of substance.
show comments
pinkmuffinere
> one really gets the sense that it took them 75 minutes to go from "things are breaking" to "we've narrowed it down to a single service endpoint, but are still researching," which is something of a bitter pill to swallow
Is 75 minutes really considered that long of a time? I don't do my day-job in webdev, so maybe I'm just naive. But being able to diagnose the single service endpoint in 75 minutes seems pretty good to me. When I worked on firmware we frequently spent _weeks_ trying to diagnose what part of the firmware was broken.
show comments
whatever1
Tech will learn like manufacturing folks did that experience is not fungible. You can try to replace someone, but the new guy also needs to accumulate the scars from the system for years before taking over.
You cannot just keep abstracting and chopping systems to smaller and smaller subsystems to make them easy to digest.
At some point someone needs to know how these coordinate and behave under disturbances. At some point someone needs to know at a low level what the hell is going on.
show comments
pdonis
One thing I love about El Reg is that they never shirk from calling a spade a spade.
show comments
tcgv
According to the article, the issue was caused by:
> "engineers had identified DNS resolution of the DynamoDB API endpoint for US-EAST-1 as the likely root cause"
Interestingly, we found matching errors in our own logs:
> System.Net.WebException
> The remote name could not be resolved: 'dynamodb.us-east-1.amazonaws.com'
Occurrences were recorded on:
- 2025-04-30
- 2025-05-29
- 2025-06-17
- Yesterday
We had logged this as a low-priority bug since previous incidents only affected our AWS testing environments (and never our production env which is on Azure). At the time, we assumed it was some CI/CD glitch.
It now seems that the underlying cause was this DNS issue all along, and only yesterday did it start impacting systems outside of AWS.
show comments
Esophagus4
I’ve seen this happen with startups as well -
They’ll get acquired and top people leave as their stock vests or get pushed out because the megacorp wants someone different in the seat.
The people who knew the tech are gone and you’re left with an unmaintainable mess that becomes unreliable and no one knows how to fix it.
show comments
citizenpaul
Its almost like institutional knowledge is a real thing that you cannot put on some BIG BRAIN MBA spreadsheet.
show comments
neilv
AWS is still my overall favorite cloud provider, and I use it very effectively.
I would've even liked to work at AWS myself, if it were clear that they're solving a few concerns:
1. Rumors of rough corporate culture, and you needing your manager to shield you from it. (If it can't be immediately solved for all of Amazon or white-collar, maybe start with increasing job-seeker confidence for AWS or per-team.)
2. Even very experienced engineer candidates must go through some silly corporate coding screen, and an interview to make sure they've memorized some ritual STAR answers about Leadership Principles. If your prospective manager can't even get you out of that, what worse corporate things can't they shield you from?
3. RTO. As well as all the claims it wasn't done consistent with the Leadership Principles, and claims that it's not about working effectively.
4. Difficult-sounding on-call rotation, for people who aren't shift workers. (Even if you Principal out of on-call, you don't want your teammates to be overextended, nor to have awkwardness because you're getting a more consistent sleep schedule that is denied them.)
Also, not a concern, but an idea that applies to all the FAANGs lately: What about actively renewing the impression in the field that this is a place where people who are really good go? Meta's historical approach seems to be to pay better, and to release prominent open source code, and be involved in open hardware. Google (besides having a reputation for technical/competence excellence and warmer values) historically had a big frat-pledging mystique going on, though it turned into a ritual transaction, and everyone optimized for that ritual. AWS has a lot of technical/competence excellence to be proud of, and could make sure that they're investing in various facets of that, including attracting and retaining the best workers, and helping them be most effective, and then making sure the field knows that.
show comments
aussieguy1234
The PIP culture of AWS sounds horrifying. As a decent engineer, I would not work there unless that is addressed.
I heard its as bad as this. Take a team of 5 genius engineers, the best 5 in the world.
There is a PIP quota, so one of the genius engineers must be PIP'ed, despite being in the top 5 engineers globally.
show comments
bithead
What - their AI couldn't find it sooner? Better get those RAGs in order.
jadenPete
This article seems sensationalized and lacking evidence. Layoffs alone (especially when so much of the industry did them) doesn't seem sufficient to explain today's outage, especially when we know so little of the technical details behind it. It's disappointing that The Register didn't wait until we had a postmortem from AWS before jumping to conclusions.
show comments
g-b-r
For those who worked there recently, how much does the comment at [1] reflect the current state of things?
It’s no secret that AWS has been seeing a mass talent exodus. Probably 90% of the folks I know that were the best and brightest in what they do and worked at AWS are no longer there. It’s beyond a blip, but a full blown exodus, with the talent bar severely lowered.
The writing was on the wall for a bit now that something like yesterday would happen.
show comments
1vuio0pswjnm7
""It's always DNS" is a long-standing sysadmin saw, and with good reason: a disproportionate number of outages are at their heart DNS issues. And so today, as AWS is still repairing its downed cloud as this article goes to press, it becomes clear that the culprit is once again DNS. "
I use stored DNS data.^1 The data is collected periodically and stored permanently
I seem to be unaffected by DNS-based outages
I use stored data because it is faster, e.g., faster than using DNS caches like Google DNS, Cloudflare DNS, etc., but there are obviously other other benefits
1. When I make HTTP requests there is no corresponding remote DNS query. The IP address is stored in the memory of the localhost-bound forward proxy
show comments
jqpabc123
Nothing gets sold or fixed without people who know how it's built.
show comments
ChrisMarshallNY
> When that tribal knowledge departs, you're left having to reinvent an awful lot of in-house expertise that didn't want to participate in your RTO games, or play Layoff Roulette yet again this cycle.
…
> This is a tipping point moment. Increasingly, it seems that the talent who understood the deep failure modes is gone. The new, leaner, presumably less expensive teams lack the institutional knowledge needed to, if not prevent these outages in the first place, significantly reduce the time to detection and recovery. Remember, there was a time when Amazon's "Frugality" leadership principle meant doing more with less, not doing everything with basically nothing. AWS's operational strength was built on redundant, experienced people, and when you cut to the bone, basic things start breaking.
Not just Amazon. I woke up this morning, to find my iCloud inbox stuffed with unread spam; much of it over a month old. Looks like someone restored some old backup. This was likely to correct some issues that were caused by the AWS outage; either directly, or indirectly.
It’s nice to know that Apple (or some other middleman) backs up the torrents of spam that I get.
Everything is now at Jurassic-scale. It’s all monstrously big. There’s no such thing as a “small problem,” anymore.
One thing that you get with experience, is “tribal knowledge,” and that stuff is usually impossible to properly document. I suspect that AI may, in the future, be able to incorporate some of this, but it’s by no means certain.
oaiey
I am reading the book Children of Time where descendents of mankind try to keep their tech running long after the ship started.
We are now coming into an age in which standing application and infrastructure systems have to run long past their original creators are on the ship.
In my opinion, as a industry we are not mature enough for that and we need to become better.
jnaina
When organizations begin to prioritize personal-brand builders and performative hires over the core technologists and long-tenured institutional experts who actually understand how things work, the culture inevitably shifts.
When that imbalance grows, as it has at AWS (ex-AWS here), and the volume of relentless self-promoting “LinkedIn personalities” and box-ticking DEI appointments starts to outnumber the true builders and stewards of institutional memory, the execution quality, accountability, and technical excellence begin to erode.
It is becoming increasingly clear that Andy Jassy’s leadership is no longer effective, and it is only a matter of time before Wall Street begins calling for his departure.
show comments
ortusdux
"If you were a ‘product person’ at IBM or Xerox: so you make a better copier or better computer. So what? When you have a monopoly market-share, the company’s not any more successful. So the people who make the company more successful are the sales and marketing people, and they end up running the companies. And the ‘product people’ get run out of the decision-making forums.
The companies forget how to make great products. The product sensibility and product genius that brought them to this monopolistic position gets rotted out by people running these companies who have no conception of a good product vs. a bad product. They have no conception of the craftsmanship that’s required to take a good idea and turn it into a good product. And they really have no feeling in their hearts about wanting to help the costumers.”
One of the best-written articles I've read in a long time. I wish general news coverage had this tight blend of fact, context, and long-term perspective.
pjjpo
I think it's important Amazon remains stable and a quicker resolution would have been great.
That being said, if many important services (the article mentions banking) are still single-point-of-failure in us-east-1, the least stable but cheapest region, there seems to be a problem far greater than Amazon here.
1dom
We all watched this happen across FAANG, right? In the early/mid 2010s working at Amazon meant you were cream of the crop.
By 2020, no engineer in their right mind wanted to work there because it was an infamously bad employer for people who wanted to create great tech in a nerdy-fun environment.
The AI space is showing how the "darling fun tech company" to "aggressive tech employer full of psychopaths" trope can take less than a few years now!
liampulles
This AWS outage has reminded and bolstered my confidence of the idea that there really are practical limits on how we can manage complexity.
As a codebase ages, as services grow out in scale and scope, complexity increases. Developers know this. I don't believe that you can linearly scale your support to accommodate the resulting unknown unknowns that arise. I'm not even sure you can exponentially scale your support for it. There is going to be a minimum expected resolution time set by your complexity that you cannot go under.
I think times where there have been outages like this that have been resolved quickly are the exceptional cases, this is the norm we should expect.
show comments
chicagobuss
internal reports from current AWS engineers seem to be confirming all of the speculation in this article. Shit's rotten from the inside out and you can pretty evenly blame AI, brain drain, and good old fashioned "big company politics"
This fails to recognize that the people who designed everything to rely on us east 1 did so a long time ago. "Brain drain" could just mean that they've had their fun and now want other people to deal with their mess.
>I've seen zero signs that this stems from a lack of transparency, and every indication that they legitimately did not know what was breaking for a patently absurd length of time.
That information is under NDA, so it's only natural you aren't privy to it.
Trisell
I’m pretty sure at this point I know more about AWS and AWS internals than my account solution architect and I’ve never worked for AWS.
vehementi
Mere hours into the incident, before there's any RCA, someone rushes to discredit themselves with a simplistic explanation
homeonthemtn
There's an argument to be made that each event generates new institutional knowledge for those that are there.
Bit of a double edged sword.
show comments
mynameisjoseph
I've been experiencing a similar problem after 30% layoff at my company.
30% does not tell who is expert or not. This amount of layoff includes very exprienced engineers inevitably, which causes very slow speed of problem solving, piling up of issues.
lgregg
I think they might have deeper issues still with their outage. I just got an email and retroactive charge for something I returned months ago and shows as returned on their own orders portal. The link in their transactional email also links to a totally different product.
show comments
neilv
> This is The Register, a respected journalistic outlet.
Yes, but they bristle at the thought. :)
fnordpiglet
With cbell gone and ajassy promoted and the misery inducing litany of morale self inflicted wounds it’s not surprising reliability is regressing. There’s no head of engineering like Charlie and Garman is a strong engineering leader, but coming after a sales guy and taking over a battered workforce, it’s not clear to me he can turn things around easily. Everyone I know worth a hill of salt left aws already - and the ones left - meh. That’s how attrition through misery works.
show comments
mmonaghan
Ehh I trust the reporting and generally agree that RTO was/is executed hamfisted but I dunno if this particular incident "makes" the narrative. IIRC LSE rate has been increasing for many years, maybe most of AWS's existence. This is part and parcel of building something so complex that continues to grow and evolve.
I do expect much better of them and they certainly have problems to solve but this is a big company evolution thing and not an Amazon-specific thing imo.
show comments
WesolyKubeczek
We don't care. We don't have to. We are the cloud company.
causal
Yeah. They will identify the cause, but not the cause behind the cause.
Dig1t
>Amazon remained the single largest H-1B sponsor, increasing approvals from 9,257 in 2024 to 10,044 in 2025, an addition of 787 visas.
Brain drain implies they went somewhere else where's better. Where did they go?
znpy
Amazon has officially started their day-2 era.
donavanm
Terrible article. Im ex-AWS, left as a principal after 10 years to go take another global megacorps shilling. I dont even disagree with the premise, but its so clearly a predetermined conclusion written as opinion piece to fit the hot news topic.
Ex a sloppy as hell and inconsistent premise.
> engineers had identified DNS resolution of the DynamoDB API endpoint for US-EAST-1 as the likely root cause
its the point that wasnt the root cause. The root cause was ipso facto much more complex, insidious, and unobservably spooky action at a distance. I say that not knowing the true cause but being very willing to bet a bunch of AMZN that it wasnt as simple as “herp derp dns is hard and college hire sdes dont understand octets and delegation.”
Or this stupid citation if were talking about senior/long term AWS tech roles:
> Internal documents reportedly say that Amazon suffers from 69 percent to 81 percent regretted attrition across all employment levels.
The citation _appears_ to be about consumer/retail delivery and ops folks. And how 69-80% _of total attrition is RA_. While el reg has written it trying to imply 80% _annual attrition_ in a completely different org and business unit.
So I know corey isnt stupid, and hot takes are his paycheck. But does he think his readers are stupid?
show comments
ferguess_k
I have to quote one of the comments:
> "Hopefully today will serve as a massive wake-up call for AWS"
I wouldn't hold your breath. There will be incident reviews, meetings, assessments, analysis etc. but basically boil down to what can we do to stop this from happening again without actually spending any more money. So no, not hiring fresh talent or retaining that talent already in play, no to radical overhaul of process and knowledge. No to remediation of known issues if it involves expenditure. Instead it will be do more with less. Beat the employees harder, enforce more and more diligence and output from less and less people for the same or less money. Spin it like mad with catchy titles like knowledge sharing, centers of excellence, efficiency improvement initiatives, agile resilience, and continuous operational excellence.
There’ll be shiny PowerPoint decks about empowering ownership and shifting left, while the remaining engineers are shifting caffeine straight into their bloodstream at 3 a.m.
Next quarter, they’ll unveil a bold new policy called Focus Fridays which will be promptly filled with mandatory incident retrospectives. Someone will suggest replacing ancient tooling, only to be told, “We’ll revisit that next fiscal year,” which is code for never.
Then come the internal awards: “Unsung Hero of the Outage” goes to the one poor sod who rebooted the wrong thing but accidentally fixed it.
HR will roll out a “Resilience Recognition” badge on the intranet. This will be marketed with great fanfare and excitement, showcasing how the company truly values it's employees and recognized their contribution because badges are cheap. Leadership will congratulate themselves for “learning from adversity,” and by the time the next blackout happens, they’ll have a snazzy new dashboard to watch it fail in real time along side their investment portfolio dashboard that takes up a greater fraction of their attention.
But don’t worry!!!! There’ll be a T-shirt. “I survived the 2025 AWS outage.” Comes in gray. Just like morale. If it wasn't for the negative impacts on the employees and customers the word Schadenfreude would be very applicable.
And it's a sad indictment on current management practices and in particular the MBA brigade* that this is all by design, acceptable losses on the alter of profit, albeit short-term profit. Efficiency theatre as far as the eye can see.
*Yes, the same people who think Jack Welch was a misunderstood visionary rather than the spiritual father of mass layoffs, short-termism, and shareholder-value human sacrifices. The kind who see burnout as a KPI and chaos as a “scaling opportunity.”
Next they’ll launch a “Transformation Task Force” whose primary transformation will be renaming the same broken process from post-mortem to value realization review. A new acronym, a new logo, and boom, problem solved at a low low cost, honest, the consultants said so. Until the next outage, at which point someone will quote Sun Tzu in Slack.
throw-10-13
aws is s globally centralized point of failure, it should not be allowed to exist
FrustratedMonky
Out of touch. Is Amazon going through some turmoil? Why are people leaving?
I mean in software. I Know warehouses are pretty bad.
NKosmatos
This is how articles should be written, this is why I’m reading El Reg (a.k.a. The Register) all these decades, this is what happens when high management cares only about profits and when real engineers don’t eat the RTO bullshit. Bravo for putting this online.
P.S. I’m not an Amazon hater, replace the company name with any other big one of your choice and the article will have the same meaning ;-)
show comments
benjaminclauss
big "who is John Galt" vibes in these comments lol
whateveracct
glad a company who did RTO got fucked
hope it only gets worse for them
add-sub-mul-div
Amazon has reportedly been a shitty place to work forever, so using issues that happen to be popular today to explain turnover is disingenuous.
show comments
the_real_cher
AWS has been having issues like this for years.
sporkland
Garbage reporting:
1. AWS had an outage
2. AWS has lost a lot of employees
Conclusion:
The brain drain lead to the outage...
I need an LLM trained explicitly on folks confusing correlation and causation and put a big old red dot in my address bar.
I love that there's a whole section "The talent drain evidence" trying to defend their journalistic integrity, but they then go on to totally face plant.
I’m ex Amazon. The company promotes, hires, and fires based on everything _except_ merit. I saw many projects fail due to under qualified teammates and leadership. Amazon is an incredible company but it was only a matter of time till its activism caught up with itself.
behnamoh
> It is a fact that there have been 27,000+ Amazonians impacted by layoffs between 2022 and 2024, continuing into 2025. It's hard to know how many of these were AWS versus other parts of its Amazon parent, because the company is notoriously tight-lipped about staffing issues.
Internal documents reportedly say that Amazon suffers from 69 percent to 81 percent regretted attrition across all employment levels. In other words, "people quitting who we wish didn't."
The internet is full of anecdata of senior Amazonians lamenting the hamfisted approach of their Return to Office initiative; experts have weighed in citing similar concerns.
So the title is all speculation. The author put 2 and 2 together and concluded that 10 is greater than 9.
Worthless article.
show comments
jongjong
Speaking of DNS, I still cannot comprehend why we still rely on the current complex, aging, centralized, rent-seeking DNS.
It's one one of the few parts of the internet which could potentially be replaced over time with very little disruption.
The hierarchy of resolvers could be replaced with a far simpler flat hierarchy Blockchain where people could buy and permanently own their domains directly on-chain... No recurring fees. People could host websites on the Blockchain from beyond the grave... This is kind of a dream of mine. Not possible to achieve in our current system.
Between the engineering staff and the warehouse workers, I wonder how long it will be until they have already fired everyone who ever would have been willing to work there.
Even with candidate pools of hundreds of thousands of H1-B engineers and tens of millions of illegal immigrant warehouse workers, there still comes a point where such a big company firing so many people so quickly exhausts all their options.
It reminds me of the Robot Chicken Sketch where Imperial Officers aboard the Death Star all pretend to be force choked to death by Darth Vader so they can avoid getting killed by lightsaber, then come back in under different names in different jobs. It's worse though for Amazon: nobody wants to come back.
https://www.youtube.com/watch?v=fFihTRIxCkg
Look, not to defend anything Amazon is doing, but this causal chain seems rather pareidolic and under-evidenced. You could spin some kind of crazy narrative about any major outage based on changes in policy that happened just before. But this isn't nearly the first AWS outage, and most of them happened before the recent RTO changes. It needs more evidence at best.
This is the time to accept that the path forward is keeping people and giving them the best tools you possibly can to do their work. That is, the same as has been true for decades remains so.
Yes, development tools are better every day. Yes, you can downsize. No it won’t be felt immediately. Yes, it mortgages the future and at a painfully high interest rate.
Suspending disbelief won’t make downsizing work better.
I wish to understand the virtue of Amazon culture.
It seems that at L6 and below workers are a Taylorism-style fungible widget driven to convert salary into work product, guided to create the most output for the longest time before mentally breaking down, then being swiftly replaced, with L7 and above being so incredibly political that keeping the snakes and vultures from eating your team is a full time job at every level of senior management.
It never made sense to me how such a ruthless and inhumane culture is sustainable in the long run.
I would love to hear positive counter perspectives from Amazonians because the anecdotes from my L6-L10 friends describe what sounds like an inhumane hell on earth.
It was Diwali vacation in India. It looks like the managers were not able to force everyone to walk around with their laptops and pagers hanging from their necks and waists, respectively, which they normally do.
If there's one thing I have learned from my Amazon mates, then that is they never have a true time off. Hills, beaches, a marriage in the family— no exceptions. It's so pervasive that I can't really imagine it to be voluntary, and my friends' answers on this topic have never been clear.
It was certainly suspicious that actual progress on the outage seemed to start right around U.S. west coast start of day. Updates before that were largely generic "we're monitoring and mitigating" with nothing of substance.
> one really gets the sense that it took them 75 minutes to go from "things are breaking" to "we've narrowed it down to a single service endpoint, but are still researching," which is something of a bitter pill to swallow
Is 75 minutes really considered that long of a time? I don't do my day-job in webdev, so maybe I'm just naive. But being able to diagnose the single service endpoint in 75 minutes seems pretty good to me. When I worked on firmware we frequently spent _weeks_ trying to diagnose what part of the firmware was broken.
Tech will learn like manufacturing folks did that experience is not fungible. You can try to replace someone, but the new guy also needs to accumulate the scars from the system for years before taking over.
You cannot just keep abstracting and chopping systems to smaller and smaller subsystems to make them easy to digest.
At some point someone needs to know how these coordinate and behave under disturbances. At some point someone needs to know at a low level what the hell is going on.
One thing I love about El Reg is that they never shirk from calling a spade a spade.
According to the article, the issue was caused by:
> "engineers had identified DNS resolution of the DynamoDB API endpoint for US-EAST-1 as the likely root cause"
Interestingly, we found matching errors in our own logs:
> System.Net.WebException
> The remote name could not be resolved: 'dynamodb.us-east-1.amazonaws.com'
Occurrences were recorded on:
- 2025-04-30
- 2025-05-29
- 2025-06-17
- Yesterday
We had logged this as a low-priority bug since previous incidents only affected our AWS testing environments (and never our production env which is on Azure). At the time, we assumed it was some CI/CD glitch.
It now seems that the underlying cause was this DNS issue all along, and only yesterday did it start impacting systems outside of AWS.
I’ve seen this happen with startups as well -
They’ll get acquired and top people leave as their stock vests or get pushed out because the megacorp wants someone different in the seat.
The people who knew the tech are gone and you’re left with an unmaintainable mess that becomes unreliable and no one knows how to fix it.
Its almost like institutional knowledge is a real thing that you cannot put on some BIG BRAIN MBA spreadsheet.
AWS is still my overall favorite cloud provider, and I use it very effectively.
I would've even liked to work at AWS myself, if it were clear that they're solving a few concerns:
1. Rumors of rough corporate culture, and you needing your manager to shield you from it. (If it can't be immediately solved for all of Amazon or white-collar, maybe start with increasing job-seeker confidence for AWS or per-team.)
2. Even very experienced engineer candidates must go through some silly corporate coding screen, and an interview to make sure they've memorized some ritual STAR answers about Leadership Principles. If your prospective manager can't even get you out of that, what worse corporate things can't they shield you from?
3. RTO. As well as all the claims it wasn't done consistent with the Leadership Principles, and claims that it's not about working effectively.
4. Difficult-sounding on-call rotation, for people who aren't shift workers. (Even if you Principal out of on-call, you don't want your teammates to be overextended, nor to have awkwardness because you're getting a more consistent sleep schedule that is denied them.)
Also, not a concern, but an idea that applies to all the FAANGs lately: What about actively renewing the impression in the field that this is a place where people who are really good go? Meta's historical approach seems to be to pay better, and to release prominent open source code, and be involved in open hardware. Google (besides having a reputation for technical/competence excellence and warmer values) historically had a big frat-pledging mystique going on, though it turned into a ritual transaction, and everyone optimized for that ritual. AWS has a lot of technical/competence excellence to be proud of, and could make sure that they're investing in various facets of that, including attracting and retaining the best workers, and helping them be most effective, and then making sure the field knows that.
The PIP culture of AWS sounds horrifying. As a decent engineer, I would not work there unless that is addressed.
I heard its as bad as this. Take a team of 5 genius engineers, the best 5 in the world.
There is a PIP quota, so one of the genius engineers must be PIP'ed, despite being in the top 5 engineers globally.
What - their AI couldn't find it sooner? Better get those RAGs in order.
This article seems sensationalized and lacking evidence. Layoffs alone (especially when so much of the industry did them) doesn't seem sufficient to explain today's outage, especially when we know so little of the technical details behind it. It's disappointing that The Register didn't wait until we had a postmortem from AWS before jumping to conclusions.
For those who worked there recently, how much does the comment at [1] reflect the current state of things?
[1] https://forums.theregister.com/forum/all/2025/10/20/aws_outa...
It’s no secret that AWS has been seeing a mass talent exodus. Probably 90% of the folks I know that were the best and brightest in what they do and worked at AWS are no longer there. It’s beyond a blip, but a full blown exodus, with the talent bar severely lowered.
The writing was on the wall for a bit now that something like yesterday would happen.
""It's always DNS" is a long-standing sysadmin saw, and with good reason: a disproportionate number of outages are at their heart DNS issues. And so today, as AWS is still repairing its downed cloud as this article goes to press, it becomes clear that the culprit is once again DNS. "
I use stored DNS data.^1 The data is collected periodically and stored permanently
I seem to be unaffected by DNS-based outages
I use stored data because it is faster, e.g., faster than using DNS caches like Google DNS, Cloudflare DNS, etc., but there are obviously other other benefits
1. When I make HTTP requests there is no corresponding remote DNS query. The IP address is stored in the memory of the localhost-bound forward proxy
Nothing gets sold or fixed without people who know how it's built.
> When that tribal knowledge departs, you're left having to reinvent an awful lot of in-house expertise that didn't want to participate in your RTO games, or play Layoff Roulette yet again this cycle.
…
> This is a tipping point moment. Increasingly, it seems that the talent who understood the deep failure modes is gone. The new, leaner, presumably less expensive teams lack the institutional knowledge needed to, if not prevent these outages in the first place, significantly reduce the time to detection and recovery. Remember, there was a time when Amazon's "Frugality" leadership principle meant doing more with less, not doing everything with basically nothing. AWS's operational strength was built on redundant, experienced people, and when you cut to the bone, basic things start breaking.
Not just Amazon. I woke up this morning, to find my iCloud inbox stuffed with unread spam; much of it over a month old. Looks like someone restored some old backup. This was likely to correct some issues that were caused by the AWS outage; either directly, or indirectly.
It’s nice to know that Apple (or some other middleman) backs up the torrents of spam that I get.
Everything is now at Jurassic-scale. It’s all monstrously big. There’s no such thing as a “small problem,” anymore.
One thing that you get with experience, is “tribal knowledge,” and that stuff is usually impossible to properly document. I suspect that AI may, in the future, be able to incorporate some of this, but it’s by no means certain.
I am reading the book Children of Time where descendents of mankind try to keep their tech running long after the ship started.
We are now coming into an age in which standing application and infrastructure systems have to run long past their original creators are on the ship.
In my opinion, as a industry we are not mature enough for that and we need to become better.
When organizations begin to prioritize personal-brand builders and performative hires over the core technologists and long-tenured institutional experts who actually understand how things work, the culture inevitably shifts.
When that imbalance grows, as it has at AWS (ex-AWS here), and the volume of relentless self-promoting “LinkedIn personalities” and box-ticking DEI appointments starts to outnumber the true builders and stewards of institutional memory, the execution quality, accountability, and technical excellence begin to erode.
It is becoming increasingly clear that Andy Jassy’s leadership is no longer effective, and it is only a matter of time before Wall Street begins calling for his departure.
"If you were a ‘product person’ at IBM or Xerox: so you make a better copier or better computer. So what? When you have a monopoly market-share, the company’s not any more successful. So the people who make the company more successful are the sales and marketing people, and they end up running the companies. And the ‘product people’ get run out of the decision-making forums.
The companies forget how to make great products. The product sensibility and product genius that brought them to this monopolistic position gets rotted out by people running these companies who have no conception of a good product vs. a bad product. They have no conception of the craftsmanship that’s required to take a good idea and turn it into a good product. And they really have no feeling in their hearts about wanting to help the costumers.”
- Steve Jobs - https://en.wikipedia.org/wiki/Steve_Jobs:_The_Lost_Interview
One of the best-written articles I've read in a long time. I wish general news coverage had this tight blend of fact, context, and long-term perspective.
I think it's important Amazon remains stable and a quicker resolution would have been great.
That being said, if many important services (the article mentions banking) are still single-point-of-failure in us-east-1, the least stable but cheapest region, there seems to be a problem far greater than Amazon here.
We all watched this happen across FAANG, right? In the early/mid 2010s working at Amazon meant you were cream of the crop.
By 2020, no engineer in their right mind wanted to work there because it was an infamously bad employer for people who wanted to create great tech in a nerdy-fun environment.
The AI space is showing how the "darling fun tech company" to "aggressive tech employer full of psychopaths" trope can take less than a few years now!
This AWS outage has reminded and bolstered my confidence of the idea that there really are practical limits on how we can manage complexity.
As a codebase ages, as services grow out in scale and scope, complexity increases. Developers know this. I don't believe that you can linearly scale your support to accommodate the resulting unknown unknowns that arise. I'm not even sure you can exponentially scale your support for it. There is going to be a minimum expected resolution time set by your complexity that you cannot go under.
I think times where there have been outages like this that have been resolved quickly are the exceptional cases, this is the norm we should expect.
internal reports from current AWS engineers seem to be confirming all of the speculation in this article. Shit's rotten from the inside out and you can pretty evenly blame AI, brain drain, and good old fashioned "big company politics"
https://forums.theregister.com/forum/all/2025/10/20/aws_outa...
This fails to recognize that the people who designed everything to rely on us east 1 did so a long time ago. "Brain drain" could just mean that they've had their fun and now want other people to deal with their mess.
>I've seen zero signs that this stems from a lack of transparency, and every indication that they legitimately did not know what was breaking for a patently absurd length of time.
That information is under NDA, so it's only natural you aren't privy to it.
I’m pretty sure at this point I know more about AWS and AWS internals than my account solution architect and I’ve never worked for AWS.
Mere hours into the incident, before there's any RCA, someone rushes to discredit themselves with a simplistic explanation
There's an argument to be made that each event generates new institutional knowledge for those that are there.
Bit of a double edged sword.
I've been experiencing a similar problem after 30% layoff at my company. 30% does not tell who is expert or not. This amount of layoff includes very exprienced engineers inevitably, which causes very slow speed of problem solving, piling up of issues.
I think they might have deeper issues still with their outage. I just got an email and retroactive charge for something I returned months ago and shows as returned on their own orders portal. The link in their transactional email also links to a totally different product.
> This is The Register, a respected journalistic outlet.
Yes, but they bristle at the thought. :)
With cbell gone and ajassy promoted and the misery inducing litany of morale self inflicted wounds it’s not surprising reliability is regressing. There’s no head of engineering like Charlie and Garman is a strong engineering leader, but coming after a sales guy and taking over a battered workforce, it’s not clear to me he can turn things around easily. Everyone I know worth a hill of salt left aws already - and the ones left - meh. That’s how attrition through misery works.
Ehh I trust the reporting and generally agree that RTO was/is executed hamfisted but I dunno if this particular incident "makes" the narrative. IIRC LSE rate has been increasing for many years, maybe most of AWS's existence. This is part and parcel of building something so complex that continues to grow and evolve.
I do expect much better of them and they certainly have problems to solve but this is a big company evolution thing and not an Amazon-specific thing imo.
We don't care. We don't have to. We are the cloud company.
Yeah. They will identify the cause, but not the cause behind the cause.
>Amazon remained the single largest H-1B sponsor, increasing approvals from 9,257 in 2024 to 10,044 in 2025, an addition of 787 visas.
https://www.reddit.com/r/SeattleWA/comments/1ncm25p/amazon_m...
I’m confused how they can have such a failure, they are employing the best and brightest top tier talent from India.
Hopefully they can increase their H1B allotment even more next year to help prevent these types of failures.
Dunno about brain drain but some departments seem to be having a mandate of "must have AI" when procuring products. Pump and keep pumping!
More discussion: https://news.ycombinator.com/item?id=45640838
Brain drain implies they went somewhere else where's better. Where did they go?
Amazon has officially started their day-2 era.
Terrible article. Im ex-AWS, left as a principal after 10 years to go take another global megacorps shilling. I dont even disagree with the premise, but its so clearly a predetermined conclusion written as opinion piece to fit the hot news topic.
Ex a sloppy as hell and inconsistent premise.
> engineers had identified DNS resolution of the DynamoDB API endpoint for US-EAST-1 as the likely root cause
its the point that wasnt the root cause. The root cause was ipso facto much more complex, insidious, and unobservably spooky action at a distance. I say that not knowing the true cause but being very willing to bet a bunch of AMZN that it wasnt as simple as “herp derp dns is hard and college hire sdes dont understand octets and delegation.”
Or this stupid citation if were talking about senior/long term AWS tech roles:
> Internal documents reportedly say that Amazon suffers from 69 percent to 81 percent regretted attrition across all employment levels.
The citation _appears_ to be about consumer/retail delivery and ops folks. And how 69-80% _of total attrition is RA_. While el reg has written it trying to imply 80% _annual attrition_ in a completely different org and business unit.
So I know corey isnt stupid, and hot takes are his paycheck. But does he think his readers are stupid?
I have to quote one of the comments:
> "Hopefully today will serve as a massive wake-up call for AWS"
I wouldn't hold your breath. There will be incident reviews, meetings, assessments, analysis etc. but basically boil down to what can we do to stop this from happening again without actually spending any more money. So no, not hiring fresh talent or retaining that talent already in play, no to radical overhaul of process and knowledge. No to remediation of known issues if it involves expenditure. Instead it will be do more with less. Beat the employees harder, enforce more and more diligence and output from less and less people for the same or less money. Spin it like mad with catchy titles like knowledge sharing, centers of excellence, efficiency improvement initiatives, agile resilience, and continuous operational excellence.
There’ll be shiny PowerPoint decks about empowering ownership and shifting left, while the remaining engineers are shifting caffeine straight into their bloodstream at 3 a.m.
Next quarter, they’ll unveil a bold new policy called Focus Fridays which will be promptly filled with mandatory incident retrospectives. Someone will suggest replacing ancient tooling, only to be told, “We’ll revisit that next fiscal year,” which is code for never.
Then come the internal awards: “Unsung Hero of the Outage” goes to the one poor sod who rebooted the wrong thing but accidentally fixed it.
HR will roll out a “Resilience Recognition” badge on the intranet. This will be marketed with great fanfare and excitement, showcasing how the company truly values it's employees and recognized their contribution because badges are cheap. Leadership will congratulate themselves for “learning from adversity,” and by the time the next blackout happens, they’ll have a snazzy new dashboard to watch it fail in real time along side their investment portfolio dashboard that takes up a greater fraction of their attention.
But don’t worry!!!! There’ll be a T-shirt. “I survived the 2025 AWS outage.” Comes in gray. Just like morale. If it wasn't for the negative impacts on the employees and customers the word Schadenfreude would be very applicable.
And it's a sad indictment on current management practices and in particular the MBA brigade* that this is all by design, acceptable losses on the alter of profit, albeit short-term profit. Efficiency theatre as far as the eye can see.
*Yes, the same people who think Jack Welch was a misunderstood visionary rather than the spiritual father of mass layoffs, short-termism, and shareholder-value human sacrifices. The kind who see burnout as a KPI and chaos as a “scaling opportunity.”
Next they’ll launch a “Transformation Task Force” whose primary transformation will be renaming the same broken process from post-mortem to value realization review. A new acronym, a new logo, and boom, problem solved at a low low cost, honest, the consultants said so. Until the next outage, at which point someone will quote Sun Tzu in Slack.
aws is s globally centralized point of failure, it should not be allowed to exist
Out of touch. Is Amazon going through some turmoil? Why are people leaving?
I mean in software. I Know warehouses are pretty bad.
This is how articles should be written, this is why I’m reading El Reg (a.k.a. The Register) all these decades, this is what happens when high management cares only about profits and when real engineers don’t eat the RTO bullshit. Bravo for putting this online.
P.S. I’m not an Amazon hater, replace the company name with any other big one of your choice and the article will have the same meaning ;-)
big "who is John Galt" vibes in these comments lol
glad a company who did RTO got fucked
hope it only gets worse for them
Amazon has reportedly been a shitty place to work forever, so using issues that happen to be popular today to explain turnover is disingenuous.
AWS has been having issues like this for years.
Garbage reporting: 1. AWS had an outage 2. AWS has lost a lot of employees
Conclusion: The brain drain lead to the outage...
I need an LLM trained explicitly on folks confusing correlation and causation and put a big old red dot in my address bar.
I love that there's a whole section "The talent drain evidence" trying to defend their journalistic integrity, but they then go on to totally face plant.
https://geohot.github.io/blog/jekyll/update/2025/09/13/get-o...
I’m ex Amazon. The company promotes, hires, and fires based on everything _except_ merit. I saw many projects fail due to under qualified teammates and leadership. Amazon is an incredible company but it was only a matter of time till its activism caught up with itself.
> It is a fact that there have been 27,000+ Amazonians impacted by layoffs between 2022 and 2024, continuing into 2025. It's hard to know how many of these were AWS versus other parts of its Amazon parent, because the company is notoriously tight-lipped about staffing issues. Internal documents reportedly say that Amazon suffers from 69 percent to 81 percent regretted attrition across all employment levels. In other words, "people quitting who we wish didn't." The internet is full of anecdata of senior Amazonians lamenting the hamfisted approach of their Return to Office initiative; experts have weighed in citing similar concerns.
So the title is all speculation. The author put 2 and 2 together and concluded that 10 is greater than 9.
Worthless article.
Speaking of DNS, I still cannot comprehend why we still rely on the current complex, aging, centralized, rent-seeking DNS.
It's one one of the few parts of the internet which could potentially be replaced over time with very little disruption.
The hierarchy of resolvers could be replaced with a far simpler flat hierarchy Blockchain where people could buy and permanently own their domains directly on-chain... No recurring fees. People could host websites on the Blockchain from beyond the grave... This is kind of a dream of mine. Not possible to achieve in our current system.