Interesting day. I've been on an incident bridge since 3AM. Our systems have mostly recovered now with a few back office stragglers fighting for compute.
The biggest miss on our side is that, although we designed a multi-region capable application, we could not run the failover process because our security org migrated us to Identity Center and only put it in us-east-1, hard locking the entire company out of the AWS control plane. By the time we'd gotten the root credentials out of the vault, things were coming back up.
Good reminder that you are only as strong as your weakest link.
show comments
0x5345414e
This is having a direct impact on my wellbeing. I was at Whole Foods in Hudson Yards NYC and I couldn’t get the prime discount on my chocolate bar because the system isn’t working. Decided not to get the chocolate bar. Now my chocolate levels are way too low.
show comments
indoordin0saur
Seems like major issues are still ongoing. If anything it seems worse than it did ~4 hours ago. For reference I'm a data engineer and it's Redshift and Airflow (AWS managed) that is FUBAR for me.
show comments
JCM9
Have a meeting today with our AWS account team about how we’re no longer going to be “All in on AWS” as we diversify workloads away. Was mostly about the pace of innovation on core services slowing and AWS being too far behind on AI services so we’re buying those from elsewhere.
The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!
show comments
nikolay
Choosing us-east-1 as your primary region is good, because when you're down, everybody's down, too. You don't get this luxury with other US regions!
show comments
stepri
“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.”
It’s always DNS.
show comments
haunter
The Premier League said there will be only limited VAR today w/o the automatic offside system becasue of the AWS outage. Weird timeline we live in
Cool, building in resilience seems to have worked. Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted (not sure if it would have been anyway).
My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.
The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).
Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.
Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.
show comments
chibea
One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?
Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?
show comments
sammy2255
Can't resolve any records for dynamodb.us-east-1.amazonaws.com
However, if you desperately need to access it you can force resolve it to 3.218.182.212. Seems to work for me. DNS through HN
At 3:03 AM PT AWS posted that things are recovering and sounded like issue was resolved.
Then things got worse. At 9:13 AM PT it sounds like they’re back to troubleshooting.
Honestly sounds like AWS doesn’t even really know what’s going on. Not good.
show comments
melozo
Even internal Amazon tooling is impacted greatly - including the internal ticketing platform which is making collaboration impossible during the outage. Amazon is incapable of building multi-region services internally. The Amazon retail site seems available, but I’m curious if it’s even using native AWS or is still on the old internal compute platform. Makes me wonder how much juice this company has left.
show comments
emrodre
Their status page (https://health.aws.amazon.com/health/status) says the only disrupted service is DynamoDB, but it's impacting 37 other services. It is amazing to see how big a blast radius a single service can have.
show comments
Aachen
Signal is down from several vantage points and accounts in Europe, I'd guess because of this dependence on Amazon overseas
We're having fun figuring out how to communicate amongst colleagues now! It's when it's gone when you realise your dependence
show comments
rsanheim
I wonder what kind of outage or incident or economic change will be required to cause a rejection of the big commercial clouds as the default deployment model.
The costs, performance overhead, and complexity of a modern AWS deployment are insane and so out of line with what most companies should be taking on. But hype + microservices + sunk cost, and here we are.
show comments
fairity
As this incident unfolds, what’s the best way to estimate how many additional hours it’s likely to last? My intuition is that the expected remaining duration increases the longer the outage persists, but that would ultimately depend on the historical distribution of similar incidents. Is that kind of data available anywhere?
show comments
kuon
I realize that my basement servers have better uptime than AWS this year!
I think most sysadmin don't plan for AWS outage. And economically it makes sense.
But it makes me wonder, is sysadmin a lost art?
show comments
0x002A
As Amazon moves from day-1 company as it claimed once, to be the sales company like Oracle focusing on raking money, expect more outages to come, and longer to be resolved.
Amazon is burning and driving away the technical talent and knowledge knowing the vendor lock-in will keep bringing the sweet money. You will see more sales people hoovering around your c-suites and executives, while you will face even worse technical support, that seem not knowing what they are talking about, yet alone to fix the support issue you expect to be fixed easily.
Mark my words, and if you are putting your eggs in one basket, that basket is now too complex and too interdependent, and the people who built and knew those intricacies are driven away with RTOs, move to hubs. Eventually those services; all others (and also aws services themselves) heavily dependent on, might be more fragile than the public knows.
show comments
kalleboo
It's fun watching their list of "Affected Services" grow literally in front of your eyes as they figure out how many things have this dependency.
It's still missing the one that earned me a phone call from a client.
show comments
ibejoeb
This is just a silly anecdote, but every time a cloud provider blips, I'm reminded. The worst architecture I've ever encountered was a system that was distributed across AWS, Azure, and GCP. Whenever any one of them had a problem, the system went down. It also cost 3x more than it should.
show comments
Waterluvian
I know there's a lot of anecdotal evidence and some fairly clear explanations for why `us-east-1` can be less reliable. But are there any empirical studies that demonstrate this? Like if I wanted to back up this assumption/claim with data, is there a good link for that, showing that us-east-1 is down a lot more often?
show comments
JPKab
The length and breadth of this outage has caused me to lose so much faith in AWS. I knew from colleagues who used to work there how understaffed and inefficient the team is due to bad management, but this just really concerns me.
Seems to have taken down my router "smart wifi" login page, and there's no backup router-only login option! Brilliant work, linksys....
show comments
rwky
To everyone that got paged (like me), grab a coffee and ride it out, the week can only get better!
show comments
tonypapousek
Looks like they’re nearly done fixing it.
> Oct 20 3:35 AM PDT
> The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.
show comments
abujazar
I find it interesting that AWS services appear to be so tightly integrated that when there's an issue in a region, it affects most or all services. Kind of defeats the purported resiliency of cloud services.
show comments
rirze
We just had a power outage in Ashburn starting at 10 pm Sunday night. It restored at 3:40am ish, and I know datacenters have redundant power sources but the timing is very suspicious. The AWS outage supposedly started at midnight
show comments
JCM9
US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions. Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.
AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.
Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”
show comments
rose-knuckle17
aws had an outage. Many companies were impacted. Headlines around the world blame AWS. the real news is how easy it is to identify companies that have put cost management ahead of service resiliency.
Lots of orgs operating wholly in AWS and sometimes only within us-east-1 had no operational problems last night. Some that is design (not using the impacted services). Some of that is good resiliency in design. And some of that was dumb luck (accidentally good design).
Overall, those companies that had operational problems likely wouldn't have invested in resiliancy expenses in any other deployment strategy either. It could have happened to them in Azure, GCP or even a home rolled datacenter.
show comments
weberer
Llama-5-beelzebub has escaped containment. A special task force has been deployed to the Virginia data center to pacify it.
show comments
mittermayr
Careful: NPM _says_ they're up (https://status.npmjs.org/) but I am seeing a lot of packages not updating and npm install taking forever or never finishing. So hold off deploying now if you're dependent on that.
show comments
pjmlp
It just goes to show the difference between best practices in cloud computing, and what everyone ends up doing in reality, including well known industry names.
show comments
esskay
Er...They appear to have just gone down again.
show comments
runako
Even though us-east-1 is the region geographically closest to me, I always choose another region as default due to us-east-1 (seemingly) being more prone to these outages.
Obviously, some services are only available in us-east-1, but many applications can gain some resiliency just by making a primary home in any other region.
show comments
1970-01-01
Someone, somewhere, had to report that doorbells went down because the very big cloud did not stay up.
I think we're doing the 21st century wrong.
show comments
Aldipower
My minor 2000 users web app hosted on Hetzner works fyi. :-P
show comments
me551ah
We created a single point of failure on the Internet, so that companies could avoid single points of failure in their data centers.
show comments
jjice
We got off pretty easy (so far). Had some networking issues at 3am-ish EDT, but nothing that we couldn't retry. Having a pretty heavily asynchronous workflow really benefits here.
One strange one was metrics capturing for Elasticache was dead for us (I assume Cloudwatch is the actual service responsible for this), so we were getting no data alerts in Datadog. Took a sec to hunt that down and realize everything was fine, we just don't have the metrics there.
I had minor protests against us-east-1 about 2.5 years ago, but it's a bit much to deal with now... Guess I should protest a bit louder next time.
jacquesm
Every week or so we interview a company and ask them if they have a fall-back plan in case AWS goes down or their cloud account disappears. They always have this deer-in-the-headlights look. 'That can't happen, right?'
Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.
A lot of status pages hosted by Atlasian StatusPage are down! The irony…
show comments
bob1029
One thing has become quite clear to me over the years. Much of the thinking around uptime of information systems has become hyperbolic and self-serving.
There are very few businesses that genuinely cannot handle an outage like this. The only examples I've personally experienced are payment processing and semiconductor manufacturing. A severe IT outage in either of these businesses is an actual crisis. Contrast with the South Korean government who seems largely unaffected by the recent loss of an entire building full of machines with no backups.
I've worked in a retail store that had a total electricity outage and saw virtually no reduction in sales numbers for the day. I have seen a bank operate with a broken core system for weeks. I have never heard of someone actually cancelling a subscription over a transient outage in YouTube, Spotify, Netflix, Steam, etc.
The takeaway I always have from these events is that you should engineer your business to be resilient to the real tradeoff that AWS offers. If you don't overreact to the occasional outage and have reasonable measures to work around for a day or 2, it's almost certainly easier and cheaper than building a multi cloud complexity hellscape or dragging it all back on prem.
Thinking in terms of competition and game theory, you'll probably win even if your competitor has a perfect failover strategy. The cost of maintaining a flawless eject button for an entire cloud is like an anvil around your neck. Every IT decision has to be filtered through this axis. When you can just slap another EC2 on the pile, you can run laps around your peers.
show comments
thomas_witt
DynamoDB is performing fine in production in eu-central-1.
Stupid question, why isn't the stock down? Couldn't this lead to people jumping to other providers and at the very least require some pretty big fees for do dramatically breaking SLA? Is it just not a biggest fraction of revenue to matter?
show comments
AtomicOrbital
https://m.youtube.com/watch?v=KFvhpt8FN18 clear detailed explanation of the AWS outage and how properly designed systems should have shielded the issue with zero client impact
comrade1234
I like that we can advertise to our customers that over the last X years we have better uptime than Amazon, google, etc.
show comments
d_burfoot
I think AWS should use, and provide as an offering to big customers, a Chaos Monkey tool that randomly brings down specific services in specific AZs. Example: DynamoDB is down in us-east-1b. IAM is down in us-west-2a.
Other AWS services should be able to survive this kind of interruption by rerouting requests to other AZs. Big company clients might also want to test against these kinds of scenarios.
show comments
ronakjain90
we[1] operate out of `us-east-1` but chose to not use any of the cloud based vendor lockin (sorry vercel, supabase, firebase, planetscale etc). Rather a few droplets in DigitalOcean(us-east-1) and Hetzner(eu). We serve 100 million requests/mo, few million user generated content(images)/mo at monthly cost of just about $1000/mo.
It's not difficult, it's just that we engineers chose convenience and delegated uptime to someone else.
Our entire data stack (Databricks and Omni) are all down for us also. The nice thing is that AWS is so big and widespread that our customers are much more understanding about outages, given that its showing up on the news.
padjo
Friends don’t let friends use us-east-1
polaris64
It looks like DNS has been restored: dynamodb.us-east-1.amazonaws.com. 5 IN A 3.218.182.189
show comments
cmiles8
US-East-1 and its consistent problems are literally the Achilles Heel of the Internet.
__alexs
Is there any data on which AWS regions are most reliable? I feel like every time I hear about an AWS outage it's in us-east-1.
show comments
artyom
Amazon has spent most of its HR post-pandemic efforts in:
• Laying off top US engineering earners.
• Aggressively mandating RTO so the senior technical personnel would be pushed to leave.
• Other political ways ("Focus", "Below Expectations") to push engineering leadership (principal engineers, etc) to leave, without it counting as a layoff of course.
• Migrating serious, complex workloads to entry-level employees in cheap office locations (India, Spain, etc).
This push was slow but mostly completed by Q1 this year. Correlation doesn't imply causation? I find that hard to believe in this case. AWS had outages before, but none like this "apparently nobody knows what to do" one.
Source: I was there.
helsinkiandrew
> The incident underscores the risks associated with the heavy reliance on a few major cloud service providers.
Perhaps for the internet as a whole, but for each individual service it underscores the risk of not hosting your service in multiple zones or having a backup
edtech_dev
Signal is also down for me.
show comments
goinggetthem
This is from Amazon's latest earnings call when Andy Jessy was asked why they aren't growing as much as there competitors
"I think if you look at what matters to customers, what they care they care a lot about what the operational performance is, you know, what the availability is, what the durability is, what the latency and throughput is of of the various services. And I think we have a pretty significant advantage in that area."
also
"And, yeah, you could just you just look at what's happened the last couple months. You can just see kind of adventures at some of these players almost every month. And so very big difference, I think, in security."
show comments
tonymet
I don't think blaming AWS is fair, since they typically exceed their regional and AZ SLAs
AWS makes their SLAs & uptime rates very clear, along with explicit warnings about building failover / business continuity.
Most of the questions on the AWS CSA exam are related to resiliency .
Look, we've all gone the lazy route and done this before. As usual, the problem exists between the keyboard and the chair.
show comments
rwke
With more and more parts of our lives depending on often only one cloud infrastructure provider as a single point of failure, enabling companies to have built-in redundancy in their systems across the world could be a great business.
Humans have built-in redundancy for a reason.
shinycode
It’s that period of the year when we discover AWS clients that don’t have fallback plans
show comments
shakesbeard
Slack (canvas and huddles), Circle CI and Bitbucket are also reporting issues due to this.
amai
The internet was once designed to survive a nuclear war. Nowadays it cannot even survive until tuesday.
greatgib
When I follow the link, I arrive on a "You broke reddit" page :-o
bigbuppo
Whose idea was it to make the whole world dependent on us-east-1?
show comments
hobo_mark
When did Snapchat move out of GCP?
show comments
bootsmann
Apparently hiring 1000s of software engineers every month was load bearing
ctbellmar
Various AI services (e.g. Perplexity) are down as well
show comments
t1234s
Do events like this stir conversations in small to medium size businesses to escape the cloud?
show comments
saejox
AWS has been the backbone of the internet. It is single point of failure most websites.
Other hosting services like Vercel, package managers like npm, even the docker registeries are down because of it.
renatovico
docker hub or github cache internal maybe is affected:
We are on Azure. But our CI/CD pipelines are failing, because Docker is on AWS.
mumber_typhoon
>Oct 20 12:51 AM PDT We can confirm increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. This issue may also be affecting Case Creation through the AWS Support Center or the Support API. We are actively engaged and working to both mitigate the issue and understand root cause. We will provide an update in 45 minutes, or sooner if we have additional information to share.
Weird that case creation uses the same region as the case you'd like to create for.
show comments
port3000
Even railway's status page is down (guess they use Vercel):
It won't be over until long after AWS resolves it - the outages produce hours of inconsistent data. It especially sucks for financial services, things of eventual consistency and other non-transactional processes. Some of the inconsistencies introduced today will linger and make trouble for years.
Ekaros
Wasn't the point why AWS is so much premium that you will always get at least 6 nines if not more in availability?
show comments
JCM9
US-East-1 is literally the Achilles Heel of the Internet.
show comments
raw_anon_1111
From the great Corey Quinn
Ah yes, the great AWS us-east-1 outage.
Half the internet’s on fire, engineers haven’t slept in 18 hours, and every self-styled “resilience thought leader” is already posting:
“This is why you need multi-cloud, powered by our patented observability synergy platform™.”
Shut up, Greg.
Your SaaS product doesn’t fix DNS, you're simply adding another dashboard to watch the world burn in higher definition.
If your first reaction to a widespread outage is “time to drive engagement,” you're working in tragedy tourism. Bet your kids are super proud.
Meanwhile, the real heroes are the SREs duct-taping Route 53 with pure caffeine and spite.
it is very funny to me that us-east-1 going down nukes the internet. all those multiple region reliability best practices are for show
altbdoor
Had a meeting where developers were discussing the infrastructure for an application. A crucial part of the whole flow was completely dependant on an AWS service. I asked if it was a single point of failure. The whole room laughed, I rest my case.
show comments
mrcsharp
Bitbucket seems affected too [1]. Not sure if this status page is regional though.
Seems to be really only in us-east-1, DynamoDB is performing fine in production on eu-central-1.
the-chitmonger
I'm not sure if this is directly related, but I've noticed my Apple Music app has stopped working (getting connection error messages). Didn't realize the data for Music was also hosted on AWS, unless this is entirely unrelated? I've restarted my phone and rebooted the app to no avail, so I'm assuming this is the culprit.
binsquare
The internal disruption reviews are going to be fun :)
show comments
menomatter
What are the design best practices and industry standards for building on-premise fallback capabilities for critical infrastructure? Say for health care/banking ..etc
show comments
renegade-otter
If we see more of this, it would not be crazy to assume that all this compelling of engineers to "use AI" and the flood of Looks Good To Me code is coming home.
show comments
geye1234
Potentially-ignoramus comment here, apologies in advance, but amazon.com itself appears to be fine right now. Perhaps slower to load pages, by about half a second. Are they not eating (much of) their own dog food?
show comments
jug
Of course this happens when I take a day off from work lol
Came here after the Internet felt oddly "ill" and even got issues using Medium, and sure enough https://status.medium.com
I am the CEO of the company and started it because I wanted to give engineering teams an unbreakable cloud. You can mix-n-match services of ANY cloud provider, and workloads failover seamlessly across clouds/on-prem environments.
Feel free to get in touch!
werdl
Looks like a DNS issue - dynamodb.us-east-1.amazonaws.com is failing to resolve.
show comments
twistedpair
Wow, about 9 hours later and 21 of 24 Atlassian services are still showing up as impacted on their status page.
Even @ 9:30am ET this morning, after this supposedly was clearing up, my doctor's office's practice management software was still hosed. Quite the long tail here.
I can't do anything for school because Canvas by Instructure is down because of this.
itqwertz
Did they try asking Claude to fix these issues? If it turns out this problem is AI-related, I'd love to see the AAR.
CTDOCodebases
I'm getting rate limit issues on Reddit so it could be related.
mannyv
This is why we use us-east-2.
show comments
wcchandler
This is usually something I see on Reddit first, within minutes. I’ve barely seen anything on my front page. While I understand it’s likely the subs I’m subscribed to, that was my only reason for using Reddit. I’ve noticed that for the past year - more and more tech heavy news events don’t bubble up as quickly anymore. I also didn’t see this post for a while for whatever reason. And Digg was hit and miss on availability for me, and I’m just now seeing it load with an item around this.
I think I might be ready to build out a replacement through vibe coding. I don’t like being dependent on user submissions though. I feel like that’s a challenge on its own.
show comments
DanHulton
I forget where I read it originally, but I strongly feel that AWS should offer a `us-chaos-1` region, where every 3-4 days, one or two services blow up. Host your staging stack there and you build real resiliency over time.
(The counter joke is, of course, "but that's `us-east-1` already! But I mean deliberately and frequently.)
This website just seems to be an auto-generated list of "things" with a catchy title:
> 5000 Reddit users reported a certain number of problems shortly after a specific time.
> 400000 A certain number of reports were made in the UK alone in two hours.
stego-tech
Not remotely surprised. Any competent engineer knows full well the risk of deploying into us-east-1 (or any “default” region for that matter), as well as the risks of relying on global services whose management or interaction layer only exists in said zone. Unfortunately, us-east-1 is the location most outsourcing firms throw stuff, because they don’t have to support it when it goes pear-shaped (that’s the client’s problem, not theirs).
My refusal to hoard every asset into AWS (let alone put anything of import in us-east-1) has saved me repeatedly in the past. Diversity is the foundation of resiliency, after all.
show comments
mentalgear
> Amazon Alexa: routines like pre-set alarms were not functioning.
It's ridiculous how everything is being stored in the cloud, even simple timers. It's past high time to move functionality back on-device, which would come with the advantage of making it easier to de-connect from big tech's capitalist surveillance state as well.
show comments
EbNar
May be because of this that trying to pay with PayPal on Lenovo's website has failed thrice for me today? Just asking... Knowing how everything is connected nowadays it wouldn't surprise me at all.
can't log into https://amazon.com either after logging out; so many downstream issues
seviu
I can't log in to my AWS account, in Germany, on top of that it is not possible to order anything or change payment options from amazon.de.
No landing page explaining services are down, just scary error pages. I thought account was compromised. Thanks HN for, as always, being the first to clarify what's happening.
Scary to see that in order to order from Amazon Germany, us-east1 must be up. Everything else works flawlessly but payments are a no go.
Nowadays when this happens it's always something. "Something went wrong."
Even the error message itself is wrong whenever that one appears.
show comments
fujigawa
Appears to have also disabled that bot on HN that would be frantically posting [dupe] in all the other AWS outage threads right about now.
show comments
dabinat
My site was down for a long time after they claimed it was fixed. Eventually I realized the problem lay with Network Load Balancers so I bypassed them for now and got everything back up and running.
littlecranky67
Just a couple of days ago in this HN thread [0] there were quite some users claiming Hetzner is not an options as their uptime isn't as good as AWS, hence the higher AWS pricing is worth the investment. Oh, the irony.
Hey wait wasn't the internet supposed to route around...?
fastball
One of my co-workers was woken up by his Eight Sleep going haywire. He couldn't turn it off because the app wouldn't work (presumably running on AWS).
fsto
Ironically, the HTTP request to this article timed out twice before a successful response.
rickette
Couple of years ago us-east was considered the least stable region here on HN due to its age. Is that still a thing?
show comments
whatsupdog
I can not login to my AWS account. And, the "my account" on regular amazon website is blank on Firefox, but opens on Chrome.
Edit: I can login into one of the AWS accounts (I have a few different ones for different companies), but my personal which has a ".edu" email is not logging in.
lexandstuff
Yes, we're seeing issues with Dynamo, and potentially other AWS services.
Appears to have happened within the last 10-15 minutes.
show comments
aaronbrethorst
My ISP's DNS servers were inaccessible this morning. Cloudflare and Google's DNS servers have all been working fine, though: 1.1.1.1, 1.0.0.1, and 8.8.8.8
glemmaPaul
LOL making one db service a central point of failure, charge gold for small compute instances. Rage about needing Multi AZ, make the costs come onto the developer/organization. But, now fail on a region level, so are we going to now have multi-country setup for simple small applications?
show comments
toephu2
Half the internet goes down because part of AWS goes down... what happened to companies having redundant systems and not having a single point of failure?
show comments
a-dub
i am amused at how us-east-1 is basically in the same location as where aol kept its datacenters back in the day.
rafa___
"Oct 20 2:01 AM PDT We have identified a potential root cause for error rates for the DynamoDB APIs in the US-EAST-1 Region. Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1..."
It's always DNS...
karel-3d
Slack was down, so I thought I will send message to my coworkers on Signal.
Signal was also down.
show comments
shawn_w
One of the radio stations I listen to is just dead air tonight. I assume this is the cause.
show comments
lsllc
The Ring (Doorbell) App isn't working, nor is any the MBTA (Transit) Status pages/apps.
show comments
thundergolfer
This is widespread. ECR, EC2, Secrets Manager, Dynamo, IAM are what I've personally seen down.
twistedpair
I just saw services that were up since 545AM ET go down around 12:30PM ET.
Seems AWS has broken Lambda again in their efforts to fix things.
kevinsundar
AWS pros know to never use us-east-1. Just don't do it. It is easily the least reliable region
ssehpriest
Airtable is down as-well.
A lot of businesses have all their workflows depending on their data on airtable.
1970-01-01
Completely detached from reality, AMZN has been up all day and closed up 1.6%. Wild.
show comments
okr
Btw. we had a forced EKS restart last week on thursday due to Kubernetes updates. And something was done with DNS there. We had problems with ndots. Caused some trouble here. Would not be surprised, if it is related, heh.
sam1r
Chime has completely been down for almost 12 hours.
Impacting all banking series with red status error. Oddly enough, only their direct deposits are functioning without issues.
AWS's own management console sign-in isn't even working. This is a huge one. :(
draxil
I was just about to post that it didn't affect us (heavy AWS users, in eu-west-1). Buut, I stopped myself because that was just massively tempting fate :)
moribvndvs
So, uh, over the weekend I decided to use the fact that my company needs a status checker/page to try out Elixir + Phoenix LiveView, and just now I found out my region is down while tinkering with it and watching Final Destination. That’s a little too on the nose for my comfort.
show comments
assimpleaspossi
I'm thinking about that one guy who clicked on "OK" or hit return.
show comments
valdiorn
I missed a parcel delivery because a computer server in Virginia, USA went down, and now the doorbell on my house in England doesn't work. What. The. Fork.
How the hell did Ring/Amazon not include a radio-frequency transmitter for the doorbell and chime? This is absurd.
To top it off, I'm trying to do my quarterly VAT return, and Xero is still completely borked, nearly 20 hours after the initial outage.
vivzkestrel
stupid question: is buying a server rack and running it at home subject to more downtimes in a year than this? has anyone done an actual SLA analysis?
show comments
littlecranky67
Maybe unrelated, but yesterday I went to pick up my package from an Amazon Locker in Germany, and the display said "Service unavailable". I'll wait until later today before I go and try again.
show comments
BiraIgnacio
It's scary to think about how much power and perhaps influence the AWS platform has. (albeit it shouldn't be surprising)
mslm
Happened to be updating a bunch of NPM dependencies and then saw `npm i` freeze and I'm like... ugh what did I do. Then npm login wasn't working and started searching here for an outage, and wala.
show comments
codebolt
Atlassian cloud is also having issues. Closing in on the 3 hour mark.
It seems that all the sites that ask for distributed systems in their interview and has their website down wouldn't even pass their own interview.
This is why distributed systems is an extremely important discipline.
show comments
bstsb
glad all my services are either Hetzner servers or EU region of AWS!
pardner
Darn, on Heroku even the "maintenance mode" (redirects all routes to a static url) won't kick in.
suralind
I wonder how their nines are going. Guess they'll have to stay pretty stable for the next 100 years.
grk
Does anyone know if having Global Accelerator set up would help right now? It's in the list of affected services, I wonder if it's useful in scenarios like this one.
pageandrew
Can't even get STS tokens. RDS Proxy is down, SQS, Managed Kafka.
randomtoast
Thing is us-east-1 the primary region for many services of AWS. DynamoDB is a very central offering used by many service. And the issue that has happend is very common[^1].
I think no matter how hard you try to avoid it, in the end there's always a massive dependency chain for modern digital infrastructure[^2].
I seem to recall other issues around this time in previous years. I wonder if this is some change getting shoe-horned in ahead of some reinvent release deadline...
roosgit
Can confirm. I was trying to send the newsletter (with SES) and it didn't work. I was thinking my local boto3 was old, but I figured I should check HN just in case.
YouAreWRONGtoo
I don't get how you can be a trillion dollar company and still suck this much.
My Alexa is hit or miss at responding to queries right now at 5:30 AM EST. Was wondering why it wasn't answering when I woke up.
alvis
Why would us-east-1 cause many UK banks and even UK gov web sites down too!?
Shouldn't they operate in the UK region due to GDPR?
show comments
danielpetrica
In this moments I think devs should invest in vendor independence if they can. While I'm not to that stage yet (cloudlfare dependence) using open technologies like docker (or Kubernetes), Traefik instead of managed services can help in this disaster situations by switching to a different provider in a faster way than having to rebuild from zero.
as a disclosure I'm not still to that point on my infrastructure But I'm trying to slowly define one for my self
show comments
motbus3
Always a lovely Monday when you wake just in time to see everything going down
klon
Statuspage.io seems to load (but is slow) but what is the point if you can't post an incident because Atlassian ID service is down.
cranberryturkey
Sling still down at 11:42PM PST
show comments
arrty88
I expect gcp and azure to gain some customers after this
hexbin010
Why after all these years is us-east-1 such a SPOF?
Danborg
r/aws not found
There aren't any communities on Reddit with that name. Double-check the community name or start a new community.
yuvadam
During the last us-east-1 apocalypse 14 years ago, I started awsdowntime.com - don't make me regsiter it again and revive the page.
show comments
jpfromlondon
This will always be a risk when sharecropping.
montek01singh
I cannot create a support ticket with AWS as well.
hipratham
Strangely some of our services are scaling up on east-1, and there is downtick on downdetector.com so issue might be resolving.
donmb
Asana down
Postman workspaces don't load
Slack affected
And the worst: heroku scheduler just refused to trigger our jobs
nivekney
Wait a second, Snapchat impacted AGAIN? It was impacted during the last GCP outage.
world2vec
Slack and Zoom working intermittently for me
kedihacker
Only us east 1 gets new services immediately others might do but not a guarantee. Which regions are a good alternative
jcmeyrignac
Impossible to connect to JIRA here (France).
show comments
antihero
My website on the cupboard laptop is fine.
sinpor1
His influence is so great that it caused half of the internet to stop working properly.
As of 4:26am Central Time in the USA, it's back up for one of my services.
pmig
Thanks god we built all our infra on top of EKS, so everything works smoothly =)
president_zippy
I wonder how much better the uptime would be if they made a sincere effort to retain engineering staff.
Right now on levels.fyi, the highest-paying non-managerial engineering role is offered by Oracle. They might not pay the recent grads as well as Google or Microsoft, but they definitely value the principal engineers w/ 20 years of experience.
sph
10:30 on a Monday morning and already slacking off. Life is good. Time to touch grass, everybody!
bpye
Amazon.ca is degraded, some product pages load but can't see prices. Amusing.
TrackerFF
Lots of outage in Norway, started approximately 1 hour ago for me.
ZeWaka
Alexa devices are also down.
show comments
nla
I still don't know why anyone would use AWS hosting.
gritzko
idiocracy_window_view.jpg
ryanmcdonough
Now, I may well be naive - but isn't the point of these systems that you fail over gracefully to another data centre and no-one notices?
show comments
bicepjai
Is this the outage that took Medium down ?
chistev
What is HN hosted on?
show comments
8cvor6j844qw_d6
That's unusual.
I wss under the impression that having multiple available zones guarantees high availability.
It seems this is not the case.
show comments
croemer
Coinbase down as well
killingtime74
Signal is down for me
show comments
mmmlinux
Ohno, not Fortnite! oh, the humanity.
testemailfordg2
Seems like we need more anti-trust cases on AWS or need to break it down, it is becoming too big. Services used in rest of the world get impacted by issues in one region.
show comments
htrp
thundering herd problems.... every time they say they fix it something else breaks
Great. Hope they’re down for a few more days and we can get some time off.
codebolt
Atlassian cloud is having problems as well.
hubertzhang
I cannot pull images from docker hub.
jimrandomh
The RDS proxy for our postgres DB went down.
mk89
It's fun to see SRE jumping left and right when they can do basically nothing at all.
"Do we enable DR? Yes/No". That's all you can do. If you do, it's a whole machinery starting, which might take longer than the outage itself.
They can't even use Slack to communicate - messages are being dropped/not sent.
And then we laugh at the South Koreans for not having backed up their hard drives (which got burnt by actual fire, a statistically way less occurring event than an AWS outage). OK that's a huge screw up, but hey, this is not insignificant either.
What will happen now? Nothing, like nothing happened after Crowdstrike's bug last year.
show comments
al_james
Cant even login via the AWS access portal.
seanieb
Clearly this is all some sort of mass delusion event, the Amazon Ring status says everything is working.
(Useless service status pages are incredibly annoying)
show comments
goodegg
Terraform Cloud is having problem too
andrewinardeer
Signal not working here for me in AU
motiejus
Too big to recover.
tosh
SES and signal seem to work again
mpcoder
I can't even see my EKS clusters
show comments
mcphage
It shouldn’t, but it does. As a civilization, we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive. So everything is resting on a giant pile of single point of failures.
Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.
show comments
skywhopper
There are plenty of ways to address this risk. But the companies impacted would have to be willing to invest in the extra operational cost and complexity. They aren’t.
starkindustries
Zoom is unable to send screenshots.
show comments
circadian
BGP (again)?
bitpatch
Can confirm, also getting hit with this.
homeonthemtn
"We should have a fail back to US-West."
"It's been on the dev teams list for a while"
"Welp....."
IOT_Apprentice
Apparently IMDb, an Amazon service is impacted. LOL, no multi region failover.
j45
More and more I want to be could agnostic or multi-cloud.
AtNightWeCode
Considering the history of east-1 it is fascinating that it still causes so many single point of failure incidents for large enterprises.
dude250711
They are amazing at LeetCode though.
show comments
nodesocket
Affecting Coinbase[1] as well, which is ridiculous. Can't access the web UI at all. At their scale and importance they should be multi-region if not multi-cloud.
you put your sh*t in us-east-1 you need to plan for this :)
ryanmcdonough
Now, I may well be naive - but isn't the point of these systems that you fail over gracefully to another data centre and no-one notices?
show comments
redwood
Surprising and sad to see how many folks are using DynamoDB
There are more full featured multi-cloud options that don't lock you in and that don't have the single point of failure problems.
And they give you a much better developer experience...
Sigh
gramakri2
npm registry also down
ktosobcy
Uhm... E(U)ropean sovereigny (and in general spreading the hosting as much as possbile) needed ASAP…
show comments
solatic
And yet, AMZN is up for the day. The market doesn't care. Crazy.
Aldipower
altavista.com is also down!
askonomm
Docker is also down.
show comments
binsquare
Don't miss this
goodegg
Happy Monday People
amelius
Medium also.
lawlessone
Am i imagining it or are more things like this happening in recent weeks than usual?
rdm_blackhole
My app deployed on Vercel and therefore indirectly deployed on us-east-1 was down for about 2 hours today then came back up and then went down again 10 minutes ago for 2 or 3 minutes. It seems like they are still intermittent issues happening.
ArcHound
Good luck to all on-callers today.
It might be an interesting exercise to map how many of our services depend on us-east-1 in one way or another. One can only hope that somebody would do something with the intel, even though it's not a feature that brings money in (at least from business perspective).
thinkindie
Today’s reminder: multi-region is so hard even AWS can’t get it right.
busymom0
For me Reddit is down and also the amazon home page isn't showing any items for me.
neuroelectron
Sounds like a circular error with monitoring is flooding their network with metrics and logs, causing DNS to fail and produce more errors, flooding the network. Likely root cause is something like DNS conflicts or hosts being recreated on the network. Generally this is a small amount of network traffic but the LBs are dealing with host address flux, causing the hosts to keep colliding host addresses as they attempt to resolve to a new host address which are being lost from dropped packets and with so many hosts in one AZ, there's a good chance they end up with a new conflicting address.
megous
I didn't even notice anything was wrong today. :) Looks like we're well disconnected from the US internet infra quasi-hegemony.
webdoodle
I in-housed an EMR for a local clinic because of latency and other network issues taking the system offline several times a month (usually at least once a week). We had zero downtime the whole first year after bringing it all in house, and I got employee of the month for several months in a row.
kitd
O ffs. I can't even access the NYT puzzles in the meantime ... Seriously disrupted, man
ta1243
Paying for resilience is expensive. not as expensive as AWS, but it's not free.
Modern companies live life on the edge. Just in time, no resilience, no flexibility. We see the disaster this causes whenever something unexpected happens - the Evergiven blocking Suez for example, let alone something like Covid
However increasingly what should be minor loss of resilience, like an AWS outage or a Crowdstrike incident, turns into major failures.
This fragility is something government needs to legislate to prevent. When one supermarket is out that's fine - people can go elsewhere, the damage is contained. When all fail, that's a major problem.
On top of that, the attitude that the entire sector has is also bad. People thing IT should tail once or twice a year and it's not a problem. If that attitude affect truly important systems it will lead to major civil projects. Any civilitsation is 3 good meals away from anarchy.
There's no profit motive to avoid this, companies don't care about being offline for the day, as long as all their mates are also offline.
grenran
seems like services are slowly recovering
tosh
seeing issues with SES in us-east-1 as well
pantulis
Now I know why the documents I was sending to my Kindle didn't go through.
moralestapia
Curious to know how much does an outage like this cost to others.
Lost data, revenue, etc.
I'm not talking about AWS but whoever's downstream.
Is it like 100M, like 1B?
add-sub-mul-div
Keep going
Ygg2
Ironically enough I can't access Reddit due to no healthy upstream.
JCharante
Ring is affected. Why doesn’t Ring have failover to another region?
show comments
spwa4
Reddit seems to be having issues too:
"upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection timeout"
Well that takes down Docker Hub as well it looks like.
show comments
t0lo
It's weird that we're living in a time where this could be a taste of a prolonged future global internet blackout by adversarial nations. Get used to this feeling I guess :)
t0lo
Can't log into tidal for my music
show comments
chaidhat
is this why docker is down?
show comments
empressplay
Can't check out on Amazon.com.au, gives error page
"Pushed to use artificial intelligence, software developers at the e-commerce giant say they must work faster and have less time to think."
Every bit of thinking time spent on a dysfunctional, lying "AI" agent could be spent on understanding the system. Even if you don't move your mouse all the time in order to please a dumb middle manager.
dangoodmanUT
Reminder that AZs don't go down
Entire regions go down
Don't pay for intra-az traffic friends
aiiizzz
Slack was acting slower than usual, but did not go down. Color me impressed.
jumploops
"Never choose us-east-1"
show comments
jdlyga
Time to start calling BS on the 9's of reliability
zwnow
I love this to be honest. Validates my anti cloud stance.
show comments
throw-10-13
this is why you avoid us-east-1
zoklet-enjoyer
Is this why Wordle logged me out and my 2 guesses don't seem to have been recorded? I am worried about losing my streak.
There are entire apps like Reddit that are still not working. What the fuck is going on?
ta1243
Meanwhile my pair of 12 year old raspberry pi's hangling my home services like DNS survive their 3rd AWS us-east-1 outage.
"But you can't do webscale uptime on your own"
Sure. I suspect even a single pi with auto-updates on has less downtime.
Readerium
99.999 percent lol
kkfx
Honestly anyone do have outages, that's nothing extraordinary, what's wrong is the number of impacted services. We choose (at least almost choose) to ditch mainframes for clusters also for resilience. Now with cheap desktop iron labeled "stable enough to be a serious server" we have seen mainframes re-created sometimes with a cluster of VM on top of a single server, sometimes with cloud services.
Ladies and Gentleman's it's about time to learn reshoring in the IT world as well. Owning nothing, renting all means extreme fragility.
Xenoamorphous
Slack now failing for me.
worik
This outage is a reminder:
Economic efficiency and technical complexity are both, separately and together, enemies of resilience
LightBug1
Remember when the "internet will just route around a network problem"?
FFS ...
SergeAx
How much longer are we going to tolerate this marketing bullshit about "Designed to provide 99.999999999% durability and 99.99% availability"?
show comments
DataDaemon
But but this is a cloud, it should exist in the cloud.
martinheidegger
Designed to provide 99.999% durability and 99.999% availability
Still designed, not implemented
dddfdfdfdfdf
hello world
dddfdfdfdfdf
Hello world
nemo44x
Someone’s got a case of the Monday’s.
xodice
Major us-east-1 outages happened in 2011, 2015, 2017, 2020, 2021, 2023, and now again. I understand that us-east-1, N. VA, was the first DC but for fucks sake they've had HOW LONG to finish AWS and make us-east-1 not be tied to keeping AWS up.
show comments
grebc
Good thing hyperscalers provide 100% uptime.
show comments
wartywhoa23
Someone vibecoded it down.
robertpohl
Looks like we're back!
BartjeD
So much for the peeps claiming amazing Cloud uptime ;)
show comments
avi_vallarapu
This is the reason why it is important to plan Disaster recovery and also plan Multi-Cloud architectures.
Our applications and databases must have ultra high availability. It can be achieved with applications and data platforms hosted on different regions for failover.
Critical businesses should also plan for replication across multiple cloud platforms.
You may use some of the existing solutions out there that can help with such implementations for data platforms.
- Qlik replicate
- HexaRocket
and some more.
Or rather implement native replication solutions available with data platforms.
Interesting day. I've been on an incident bridge since 3AM. Our systems have mostly recovered now with a few back office stragglers fighting for compute.
The biggest miss on our side is that, although we designed a multi-region capable application, we could not run the failover process because our security org migrated us to Identity Center and only put it in us-east-1, hard locking the entire company out of the AWS control plane. By the time we'd gotten the root credentials out of the vault, things were coming back up.
Good reminder that you are only as strong as your weakest link.
This is having a direct impact on my wellbeing. I was at Whole Foods in Hudson Yards NYC and I couldn’t get the prime discount on my chocolate bar because the system isn’t working. Decided not to get the chocolate bar. Now my chocolate levels are way too low.
Seems like major issues are still ongoing. If anything it seems worse than it did ~4 hours ago. For reference I'm a data engineer and it's Redshift and Airflow (AWS managed) that is FUBAR for me.
Have a meeting today with our AWS account team about how we’re no longer going to be “All in on AWS” as we diversify workloads away. Was mostly about the pace of innovation on core services slowing and AWS being too far behind on AI services so we’re buying those from elsewhere.
The AWS team keeps touting the rock solid reliability of AWS as a reason why we shouldn’t diversify our cloud. Should be a fun meeting!
Choosing us-east-1 as your primary region is good, because when you're down, everybody's down, too. You don't get this luxury with other US regions!
“Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1. We are working on multiple parallel paths to accelerate recovery.”
It’s always DNS.
The Premier League said there will be only limited VAR today w/o the automatic offside system becasue of the AWS outage. Weird timeline we live in
https://www.bbc.com/news/live/c5y8k7k6v1rt?post=asset%3Ad902...
Cool, building in resilience seems to have worked. Our static site has origins in multiple regions via CloudFront and didn’t seem to be impacted (not sure if it would have been anyway).
My control plane is native multi-region, so while it depends on many impacted services it stayed available. Each region runs in isolation. There is data replication at play but failing to replicate to us-east-1 had no impact on other regions.
The service itself is also native multi-region and has multiple layers where failover happens (DNS, routing, destination selection).
Nothing’s perfect and there are many ways this setup could fail. It’s just cool that it worked this time - great to see.
Nothing I’ve done is rocket science or expensive, but it does require doing things differently. Happy to answer questions about it.
One main problem that we observed was that big parts of their IAM / auth setup was overloaded / down which led to all kinds of cascading problems. It sounds as if Dynamo was reported to be a root cause, so is IAM dependent on dynamo internally?
Of course, such a large control plane system has all kinds of complex dependency chains. Auth/IAM seems like such a potentially (global) SPOF that you'd like to reduce dependencies to an absolute minimum. On the other hand, it's also the place that needs really good scalability, consistency, etc. so you probably like to use the battle proof DB infrastructure you already have in place. Does that mean you will end up with a complex cyclic dependency that needs complex bootstrapping when it goes down? Or how is that handled?
Can't resolve any records for dynamodb.us-east-1.amazonaws.com
However, if you desperately need to access it you can force resolve it to 3.218.182.212. Seems to work for me. DNS through HN
curl -v --resolve "dynamodb.us-east-1.amazonaws.com:443:3.218.182.212" https://dynamodb.us-east-1.amazonaws.com/
At 3:03 AM PT AWS posted that things are recovering and sounded like issue was resolved.
Then things got worse. At 9:13 AM PT it sounds like they’re back to troubleshooting.
Honestly sounds like AWS doesn’t even really know what’s going on. Not good.
Even internal Amazon tooling is impacted greatly - including the internal ticketing platform which is making collaboration impossible during the outage. Amazon is incapable of building multi-region services internally. The Amazon retail site seems available, but I’m curious if it’s even using native AWS or is still on the old internal compute platform. Makes me wonder how much juice this company has left.
Their status page (https://health.aws.amazon.com/health/status) says the only disrupted service is DynamoDB, but it's impacting 37 other services. It is amazing to see how big a blast radius a single service can have.
Signal is down from several vantage points and accounts in Europe, I'd guess because of this dependence on Amazon overseas
We're having fun figuring out how to communicate amongst colleagues now! It's when it's gone when you realise your dependence
I wonder what kind of outage or incident or economic change will be required to cause a rejection of the big commercial clouds as the default deployment model.
The costs, performance overhead, and complexity of a modern AWS deployment are insane and so out of line with what most companies should be taking on. But hype + microservices + sunk cost, and here we are.
As this incident unfolds, what’s the best way to estimate how many additional hours it’s likely to last? My intuition is that the expected remaining duration increases the longer the outage persists, but that would ultimately depend on the historical distribution of similar incidents. Is that kind of data available anywhere?
I realize that my basement servers have better uptime than AWS this year!
I think most sysadmin don't plan for AWS outage. And economically it makes sense.
But it makes me wonder, is sysadmin a lost art?
As Amazon moves from day-1 company as it claimed once, to be the sales company like Oracle focusing on raking money, expect more outages to come, and longer to be resolved.
Amazon is burning and driving away the technical talent and knowledge knowing the vendor lock-in will keep bringing the sweet money. You will see more sales people hoovering around your c-suites and executives, while you will face even worse technical support, that seem not knowing what they are talking about, yet alone to fix the support issue you expect to be fixed easily.
Mark my words, and if you are putting your eggs in one basket, that basket is now too complex and too interdependent, and the people who built and knew those intricacies are driven away with RTOs, move to hubs. Eventually those services; all others (and also aws services themselves) heavily dependent on, might be more fragile than the public knows.
It's fun watching their list of "Affected Services" grow literally in front of your eyes as they figure out how many things have this dependency.
It's still missing the one that earned me a phone call from a client.
This is just a silly anecdote, but every time a cloud provider blips, I'm reminded. The worst architecture I've ever encountered was a system that was distributed across AWS, Azure, and GCP. Whenever any one of them had a problem, the system went down. It also cost 3x more than it should.
I know there's a lot of anecdotal evidence and some fairly clear explanations for why `us-east-1` can be less reliable. But are there any empirical studies that demonstrate this? Like if I wanted to back up this assumption/claim with data, is there a good link for that, showing that us-east-1 is down a lot more often?
The length and breadth of this outage has caused me to lose so much faith in AWS. I knew from colleagues who used to work there how understaffed and inefficient the team is due to bad management, but this just really concerns me.
Looks like it affected Vercel, too. https://www.vercel-status.com/
My website is down :(
(EDIT: website is back up, hooray)
Seems to have taken down my router "smart wifi" login page, and there's no backup router-only login option! Brilliant work, linksys....
To everyone that got paged (like me), grab a coffee and ride it out, the week can only get better!
Looks like they’re nearly done fixing it.
> Oct 20 3:35 AM PDT
> The underlying DNS issue has been fully mitigated, and most AWS Service operations are succeeding normally now. Some requests may be throttled while we work toward full resolution. Additionally, some services are continuing to work through a backlog of events such as Cloudtrail and Lambda. While most operations are recovered, requests to launch new EC2 instances (or services that launch EC2 instances such as ECS) in the US-EAST-1 Region are still experiencing increased error rates. We continue to work toward full resolution. If you are still experiencing an issue resolving the DynamoDB service endpoints in US-EAST-1, we recommend flushing your DNS caches. We will provide an update by 4:15 AM, or sooner if we have additional information to share.
I find it interesting that AWS services appear to be so tightly integrated that when there's an issue in a region, it affects most or all services. Kind of defeats the purported resiliency of cloud services.
We just had a power outage in Ashburn starting at 10 pm Sunday night. It restored at 3:40am ish, and I know datacenters have redundant power sources but the timing is very suspicious. The AWS outage supposedly started at midnight
US-East-1 is more than just a normal region. It also provides the backbone for other services, including those in other regions. Thus simply being in another region doesn’t protect you from the consistent us-east-1 shenanigans.
AWS doesn’t talk about that much publicly, but if you press them they will admit in private that there are some pretty nasty single points of failure in the design of AWS that can materialize if us-east-1 has an issue. Most people would say that means AWS isn’t truly multi-region in some areas.
Not entirely clear yet if those single points of failure were at play here, but risk mitigation isn’t as simple as just “don’t use us-east-1” or “deploy in multiple regions with load balancing failover.”
aws had an outage. Many companies were impacted. Headlines around the world blame AWS. the real news is how easy it is to identify companies that have put cost management ahead of service resiliency.
Lots of orgs operating wholly in AWS and sometimes only within us-east-1 had no operational problems last night. Some that is design (not using the impacted services). Some of that is good resiliency in design. And some of that was dumb luck (accidentally good design).
Overall, those companies that had operational problems likely wouldn't have invested in resiliancy expenses in any other deployment strategy either. It could have happened to them in Azure, GCP or even a home rolled datacenter.
Llama-5-beelzebub has escaped containment. A special task force has been deployed to the Virginia data center to pacify it.
Careful: NPM _says_ they're up (https://status.npmjs.org/) but I am seeing a lot of packages not updating and npm install taking forever or never finishing. So hold off deploying now if you're dependent on that.
It just goes to show the difference between best practices in cloud computing, and what everyone ends up doing in reality, including well known industry names.
Er...They appear to have just gone down again.
Even though us-east-1 is the region geographically closest to me, I always choose another region as default due to us-east-1 (seemingly) being more prone to these outages.
Obviously, some services are only available in us-east-1, but many applications can gain some resiliency just by making a primary home in any other region.
Someone, somewhere, had to report that doorbells went down because the very big cloud did not stay up.
I think we're doing the 21st century wrong.
My minor 2000 users web app hosted on Hetzner works fyi. :-P
We created a single point of failure on the Internet, so that companies could avoid single points of failure in their data centers.
We got off pretty easy (so far). Had some networking issues at 3am-ish EDT, but nothing that we couldn't retry. Having a pretty heavily asynchronous workflow really benefits here.
One strange one was metrics capturing for Elasticache was dead for us (I assume Cloudwatch is the actual service responsible for this), so we were getting no data alerts in Datadog. Took a sec to hunt that down and realize everything was fine, we just don't have the metrics there.
I had minor protests against us-east-1 about 2.5 years ago, but it's a bit much to deal with now... Guess I should protest a bit louder next time.
Every week or so we interview a company and ask them if they have a fall-back plan in case AWS goes down or their cloud account disappears. They always have this deer-in-the-headlights look. 'That can't happen, right?'
Now imagine for a bit that it will never come back up. See where that leads you. The internet got its main strengths from the fact that it was completely decentralized. We've been systematically eroding that strength.
Robinhood's completely down. Even their main website: https://robinhood.com/
I'm so happy we chose Hetzner instead but unfortunately we also use Supabase (dashboard affected) and Resend (dashboard and email sending affected).
Probably makes sense to add "relies on AWS" to the criteria we're using to evaluate 3rd-party services.
Wonder if this is related
https://www.dockerstatus.com/pages/533c6539221ae15e3f000031
Oh no... may be LaLiga found out pirates hosting on AWS?
Is this why reddit is down? (https://www.redditstatus.com/ still says it is up but with degraded infrastructure)
It has started recovering now. https://www.whatsmydns.net/#A/dynamodb.us-east-1.amazonaws.c... is showing full recovery of dns resolutions.
Man , I just wanted to enjoy celebrating Diwali with my family but been up from 3am trying to recover our services. There goes some quality time
Internet, out.
Very big day for an engineering team indeed. Can't vibe code your way out of this issue...
AWS truly does stand for "All Web Sites".
Our Alexa's stopped responding and my girl couldn't log in to myfitness pal anymore.. Let me check HN for a major outage and here we are :^)
At least when us-east is down, everything is down.
funny that even if we have our app running fine in AWS europe, we are affected as developers because of npm/docker/etc being down. oh well.
"The root cause is an underlying internal subsystem responsible for monitoring the health of our network load balancers."
https://health.aws.amazon.com/health/status?path=service-his...
A lot of status pages hosted by Atlasian StatusPage are down! The irony…
One thing has become quite clear to me over the years. Much of the thinking around uptime of information systems has become hyperbolic and self-serving.
There are very few businesses that genuinely cannot handle an outage like this. The only examples I've personally experienced are payment processing and semiconductor manufacturing. A severe IT outage in either of these businesses is an actual crisis. Contrast with the South Korean government who seems largely unaffected by the recent loss of an entire building full of machines with no backups.
I've worked in a retail store that had a total electricity outage and saw virtually no reduction in sales numbers for the day. I have seen a bank operate with a broken core system for weeks. I have never heard of someone actually cancelling a subscription over a transient outage in YouTube, Spotify, Netflix, Steam, etc.
The takeaway I always have from these events is that you should engineer your business to be resilient to the real tradeoff that AWS offers. If you don't overreact to the occasional outage and have reasonable measures to work around for a day or 2, it's almost certainly easier and cheaper than building a multi cloud complexity hellscape or dragging it all back on prem.
Thinking in terms of competition and game theory, you'll probably win even if your competitor has a perfect failover strategy. The cost of maintaining a flawless eject button for an entire cloud is like an anvil around your neck. Every IT decision has to be filtered through this axis. When you can just slap another EC2 on the pile, you can run laps around your peers.
DynamoDB is performing fine in production in eu-central-1.
Seems to be really limited to us-east-1 (https://health.aws.amazon.com/health/status). I think they host a lot of console and backend stuff there.
Stupid question, why isn't the stock down? Couldn't this lead to people jumping to other providers and at the very least require some pretty big fees for do dramatically breaking SLA? Is it just not a biggest fraction of revenue to matter?
https://m.youtube.com/watch?v=KFvhpt8FN18 clear detailed explanation of the AWS outage and how properly designed systems should have shielded the issue with zero client impact
I like that we can advertise to our customers that over the last X years we have better uptime than Amazon, google, etc.
I think AWS should use, and provide as an offering to big customers, a Chaos Monkey tool that randomly brings down specific services in specific AZs. Example: DynamoDB is down in us-east-1b. IAM is down in us-west-2a.
Other AWS services should be able to survive this kind of interruption by rerouting requests to other AZs. Big company clients might also want to test against these kinds of scenarios.
we[1] operate out of `us-east-1` but chose to not use any of the cloud based vendor lockin (sorry vercel, supabase, firebase, planetscale etc). Rather a few droplets in DigitalOcean(us-east-1) and Hetzner(eu). We serve 100 million requests/mo, few million user generated content(images)/mo at monthly cost of just about $1000/mo.
It's not difficult, it's just that we engineers chose convenience and delegated uptime to someone else.
[1] - https://usetrmnl.com
Twilio is down worldwide: https://status.twilio.com/
Looks like maybe a DNS issue? https://www.whatsmydns.net/#A/dynamodb.us-east-1.amazonaws.c...
Resolves to nothing.
Our entire data stack (Databricks and Omni) are all down for us also. The nice thing is that AWS is so big and widespread that our customers are much more understanding about outages, given that its showing up on the news.
Friends don’t let friends use us-east-1
It looks like DNS has been restored: dynamodb.us-east-1.amazonaws.com. 5 IN A 3.218.182.189
US-East-1 and its consistent problems are literally the Achilles Heel of the Internet.
Is there any data on which AWS regions are most reliable? I feel like every time I hear about an AWS outage it's in us-east-1.
Amazon has spent most of its HR post-pandemic efforts in:
• Laying off top US engineering earners.
• Aggressively mandating RTO so the senior technical personnel would be pushed to leave.
• Other political ways ("Focus", "Below Expectations") to push engineering leadership (principal engineers, etc) to leave, without it counting as a layoff of course.
• Terminating highly skilled engineering contractors everywhere else.
• Migrating serious, complex workloads to entry-level employees in cheap office locations (India, Spain, etc).
This push was slow but mostly completed by Q1 this year. Correlation doesn't imply causation? I find that hard to believe in this case. AWS had outages before, but none like this "apparently nobody knows what to do" one.
Source: I was there.
> The incident underscores the risks associated with the heavy reliance on a few major cloud service providers.
Perhaps for the internet as a whole, but for each individual service it underscores the risk of not hosting your service in multiple zones or having a backup
Signal is also down for me.
This is from Amazon's latest earnings call when Andy Jessy was asked why they aren't growing as much as there competitors
"I think if you look at what matters to customers, what they care they care a lot about what the operational performance is, you know, what the availability is, what the durability is, what the latency and throughput is of of the various services. And I think we have a pretty significant advantage in that area." also "And, yeah, you could just you just look at what's happened the last couple months. You can just see kind of adventures at some of these players almost every month. And so very big difference, I think, in security."
I don't think blaming AWS is fair, since they typically exceed their regional and AZ SLAs
AWS makes their SLAs & uptime rates very clear, along with explicit warnings about building failover / business continuity.
Most of the questions on the AWS CSA exam are related to resiliency .
Look, we've all gone the lazy route and done this before. As usual, the problem exists between the keyboard and the chair.
With more and more parts of our lives depending on often only one cloud infrastructure provider as a single point of failure, enabling companies to have built-in redundancy in their systems across the world could be a great business.
Humans have built-in redundancy for a reason.
It’s that period of the year when we discover AWS clients that don’t have fallback plans
Slack (canvas and huddles), Circle CI and Bitbucket are also reporting issues due to this.
The internet was once designed to survive a nuclear war. Nowadays it cannot even survive until tuesday.
When I follow the link, I arrive on a "You broke reddit" page :-o
Whose idea was it to make the whole world dependent on us-east-1?
When did Snapchat move out of GCP?
Apparently hiring 1000s of software engineers every month was load bearing
Various AI services (e.g. Perplexity) are down as well
Do events like this stir conversations in small to medium size businesses to escape the cloud?
AWS has been the backbone of the internet. It is single point of failure most websites.
Other hosting services like Vercel, package managers like npm, even the docker registeries are down because of it.
docker hub or github cache internal maybe is affected:
Booting builder /usr/bin/docker buildx inspect --bootstrap --builder builder-1c223ad9-e21b-41c7-a28e-69eea59c8dac #1 [internal] booting buildkit #1 pulling image moby/buildkit:buildx-stable-1 #1 pulling image moby/buildkit:buildx-stable-1 9.6s done #1 ERROR: received unexpected HTTP status: 500 Internal Server Error ------ > [internal] booting buildkit: ------ ERROR: received unexpected HTTP status: 500 Internal Server Error
We are on Azure. But our CI/CD pipelines are failing, because Docker is on AWS.
>Oct 20 12:51 AM PDT We can confirm increased error rates and latencies for multiple AWS Services in the US-EAST-1 Region. This issue may also be affecting Case Creation through the AWS Support Center or the Support API. We are actively engaged and working to both mitigate the issue and understand root cause. We will provide an update in 45 minutes, or sooner if we have additional information to share.
Weird that case creation uses the same region as the case you'd like to create for.
Even railway's status page is down (guess they use Vercel):
https://railway.instatus.com/
> due to an "operational issue" related to DNS
Always DNS..
It won't be over until long after AWS resolves it - the outages produce hours of inconsistent data. It especially sucks for financial services, things of eventual consistency and other non-transactional processes. Some of the inconsistencies introduced today will linger and make trouble for years.
Wasn't the point why AWS is so much premium that you will always get at least 6 nines if not more in availability?
US-East-1 is literally the Achilles Heel of the Internet.
From the great Corey Quinn
Ah yes, the great AWS us-east-1 outage.
Half the internet’s on fire, engineers haven’t slept in 18 hours, and every self-styled “resilience thought leader” is already posting:
“This is why you need multi-cloud, powered by our patented observability synergy platform™.”
Shut up, Greg.
Your SaaS product doesn’t fix DNS, you're simply adding another dashboard to watch the world burn in higher definition.
If your first reaction to a widespread outage is “time to drive engagement,” you're working in tragedy tourism. Bet your kids are super proud.
Meanwhile, the real heroes are the SREs duct-taping Route 53 with pure caffeine and spite.
https://www.linkedin.com/posts/coquinn_aws-useast1-cloudcomp...
it is very funny to me that us-east-1 going down nukes the internet. all those multiple region reliability best practices are for show
Had a meeting where developers were discussing the infrastructure for an application. A crucial part of the whole flow was completely dependant on an AWS service. I asked if it was a single point of failure. The whole room laughed, I rest my case.
Bitbucket seems affected too [1]. Not sure if this status page is regional though.
[1] https://bitbucket.status.atlassian.com/incidents/p20f40pt1rg...
Seems to be really only in us-east-1, DynamoDB is performing fine in production on eu-central-1.
I'm not sure if this is directly related, but I've noticed my Apple Music app has stopped working (getting connection error messages). Didn't realize the data for Music was also hosted on AWS, unless this is entirely unrelated? I've restarted my phone and rebooted the app to no avail, so I'm assuming this is the culprit.
The internal disruption reviews are going to be fun :)
What are the design best practices and industry standards for building on-premise fallback capabilities for critical infrastructure? Say for health care/banking ..etc
If we see more of this, it would not be crazy to assume that all this compelling of engineers to "use AI" and the flood of Looks Good To Me code is coming home.
Potentially-ignoramus comment here, apologies in advance, but amazon.com itself appears to be fine right now. Perhaps slower to load pages, by about half a second. Are they not eating (much of) their own dog food?
Of course this happens when I take a day off from work lol
Came here after the Internet felt oddly "ill" and even got issues using Medium, and sure enough https://status.medium.com
Anyone needing multi-cloud WITH EASE, please get in touch. https://controlplane.com
I am the CEO of the company and started it because I wanted to give engineering teams an unbreakable cloud. You can mix-n-match services of ANY cloud provider, and workloads failover seamlessly across clouds/on-prem environments.
Feel free to get in touch!
Looks like a DNS issue - dynamodb.us-east-1.amazonaws.com is failing to resolve.
Wow, about 9 hours later and 21 of 24 Atlassian services are still showing up as impacted on their status page.
Even @ 9:30am ET this morning, after this supposedly was clearing up, my doctor's office's practice management software was still hosed. Quite the long tail here.
https://status.atlassian.com/
I can't do anything for school because Canvas by Instructure is down because of this.
Did they try asking Claude to fix these issues? If it turns out this problem is AI-related, I'd love to see the AAR.
I'm getting rate limit issues on Reddit so it could be related.
This is why we use us-east-2.
This is usually something I see on Reddit first, within minutes. I’ve barely seen anything on my front page. While I understand it’s likely the subs I’m subscribed to, that was my only reason for using Reddit. I’ve noticed that for the past year - more and more tech heavy news events don’t bubble up as quickly anymore. I also didn’t see this post for a while for whatever reason. And Digg was hit and miss on availability for me, and I’m just now seeing it load with an item around this.
I think I might be ready to build out a replacement through vibe coding. I don’t like being dependent on user submissions though. I feel like that’s a challenge on its own.
I forget where I read it originally, but I strongly feel that AWS should offer a `us-chaos-1` region, where every 3-4 days, one or two services blow up. Host your staging stack there and you build real resiliency over time.
(The counter joke is, of course, "but that's `us-east-1` already! But I mean deliberately and frequently.)
We're seeing issues with multiple AWS services https://health.aws.amazon.com/health/status
This website just seems to be an auto-generated list of "things" with a catchy title:
> 5000 Reddit users reported a certain number of problems shortly after a specific time.
> 400000 A certain number of reports were made in the UK alone in two hours.
Not remotely surprised. Any competent engineer knows full well the risk of deploying into us-east-1 (or any “default” region for that matter), as well as the risks of relying on global services whose management or interaction layer only exists in said zone. Unfortunately, us-east-1 is the location most outsourcing firms throw stuff, because they don’t have to support it when it goes pear-shaped (that’s the client’s problem, not theirs).
My refusal to hoard every asset into AWS (let alone put anything of import in us-east-1) has saved me repeatedly in the past. Diversity is the foundation of resiliency, after all.
> Amazon Alexa: routines like pre-set alarms were not functioning.
It's ridiculous how everything is being stored in the cloud, even simple timers. It's past high time to move functionality back on-device, which would come with the advantage of making it easier to de-connect from big tech's capitalist surveillance state as well.
May be because of this that trying to pay with PayPal on Lenovo's website has failed thrice for me today? Just asking... Knowing how everything is connected nowadays it wouldn't surprise me at all.
https://status.tailscale.com/ clients' auth down :( what a day
Related thread: https://news.ycombinator.com/item?id=45640772
Related thread: https://news.ycombinator.com/item?id=45640838
can't log into https://amazon.com either after logging out; so many downstream issues
I can't log in to my AWS account, in Germany, on top of that it is not possible to order anything or change payment options from amazon.de.
No landing page explaining services are down, just scary error pages. I thought account was compromised. Thanks HN for, as always, being the first to clarify what's happening.
Scary to see that in order to order from Amazon Germany, us-east1 must be up. Everything else works flawlessly but payments are a no go.
Severity - Degraded...
https://health.aws.amazon.com/health/status
https://downdetector.com/
Nowadays when this happens it's always something. "Something went wrong."
Even the error message itself is wrong whenever that one appears.
Appears to have also disabled that bot on HN that would be frantically posting [dupe] in all the other AWS outage threads right about now.
My site was down for a long time after they claimed it was fixed. Eventually I realized the problem lay with Network Load Balancers so I bypassed them for now and got everything back up and running.
Just a couple of days ago in this HN thread [0] there were quite some users claiming Hetzner is not an options as their uptime isn't as good as AWS, hence the higher AWS pricing is worth the investment. Oh, the irony.
[0]: https://news.ycombinator.com/item?id=45614922
Amazon itself apperas to be out for some products. I get a "Sorry, We couldn't find that page" when clicking on products
We're seeing issues with RDS proxy. Wouldn't be surprised if a DNS issue was the cause, but who knows, will wait for the postmortem.
Can't login to Jira/Confluence either.
https://news.ycombinator.com/item?id=45640754
Slack is down. Is that related? Probably is.
02:34 Pacific: Things seem to be recovering.
Hey wait wasn't the internet supposed to route around...?
One of my co-workers was woken up by his Eight Sleep going haywire. He couldn't turn it off because the app wouldn't work (presumably running on AWS).
Ironically, the HTTP request to this article timed out twice before a successful response.
Couple of years ago us-east was considered the least stable region here on HN due to its age. Is that still a thing?
I can not login to my AWS account. And, the "my account" on regular amazon website is blank on Firefox, but opens on Chrome.
Edit: I can login into one of the AWS accounts (I have a few different ones for different companies), but my personal which has a ".edu" email is not logging in.
Yes, we're seeing issues with Dynamo, and potentially other AWS services.
Appears to have happened within the last 10-15 minutes.
My ISP's DNS servers were inaccessible this morning. Cloudflare and Google's DNS servers have all been working fine, though: 1.1.1.1, 1.0.0.1, and 8.8.8.8
LOL making one db service a central point of failure, charge gold for small compute instances. Rage about needing Multi AZ, make the costs come onto the developer/organization. But, now fail on a region level, so are we going to now have multi-country setup for simple small applications?
Half the internet goes down because part of AWS goes down... what happened to companies having redundant systems and not having a single point of failure?
i am amused at how us-east-1 is basically in the same location as where aol kept its datacenters back in the day.
"Oct 20 2:01 AM PDT We have identified a potential root cause for error rates for the DynamoDB APIs in the US-EAST-1 Region. Based on our investigation, the issue appears to be related to DNS resolution of the DynamoDB API endpoint in US-EAST-1..."
It's always DNS...
Slack was down, so I thought I will send message to my coworkers on Signal.
Signal was also down.
One of the radio stations I listen to is just dead air tonight. I assume this is the cause.
The Ring (Doorbell) App isn't working, nor is any the MBTA (Transit) Status pages/apps.
This is widespread. ECR, EC2, Secrets Manager, Dynamo, IAM are what I've personally seen down.
I just saw services that were up since 545AM ET go down around 12:30PM ET. Seems AWS has broken Lambda again in their efforts to fix things.
AWS pros know to never use us-east-1. Just don't do it. It is easily the least reliable region
Airtable is down as-well.
A lot of businesses have all their workflows depending on their data on airtable.
Completely detached from reality, AMZN has been up all day and closed up 1.6%. Wild.
Btw. we had a forced EKS restart last week on thursday due to Kubernetes updates. And something was done with DNS there. We had problems with ndots. Caused some trouble here. Would not be surprised, if it is related, heh.
Chime has completely been down for almost 12 hours.
Impacting all banking series with red status error. Oddly enough, only their direct deposits are functioning without issues.
https://status.chime.com/
AWS's own management console sign-in isn't even working. This is a huge one. :(
I was just about to post that it didn't affect us (heavy AWS users, in eu-west-1). Buut, I stopped myself because that was just massively tempting fate :)
So, uh, over the weekend I decided to use the fact that my company needs a status checker/page to try out Elixir + Phoenix LiveView, and just now I found out my region is down while tinkering with it and watching Final Destination. That’s a little too on the nose for my comfort.
I'm thinking about that one guy who clicked on "OK" or hit return.
I missed a parcel delivery because a computer server in Virginia, USA went down, and now the doorbell on my house in England doesn't work. What. The. Fork.
How the hell did Ring/Amazon not include a radio-frequency transmitter for the doorbell and chime? This is absurd.
To top it off, I'm trying to do my quarterly VAT return, and Xero is still completely borked, nearly 20 hours after the initial outage.
stupid question: is buying a server rack and running it at home subject to more downtimes in a year than this? has anyone done an actual SLA analysis?
Maybe unrelated, but yesterday I went to pick up my package from an Amazon Locker in Germany, and the display said "Service unavailable". I'll wait until later today before I go and try again.
It's scary to think about how much power and perhaps influence the AWS platform has. (albeit it shouldn't be surprising)
Happened to be updating a bunch of NPM dependencies and then saw `npm i` freeze and I'm like... ugh what did I do. Then npm login wasn't working and started searching here for an outage, and wala.
Atlassian cloud is also having issues. Closing in on the 3 hour mark.
Yeah, noticed from Zoom: https://www.zoomstatus.com/incidents/yy70hmbp61r9
It's not DNS
There's no way it's DNS
It was DNS
It seems that all the sites that ask for distributed systems in their interview and has their website down wouldn't even pass their own interview.
This is why distributed systems is an extremely important discipline.
glad all my services are either Hetzner servers or EU region of AWS!
Darn, on Heroku even the "maintenance mode" (redirects all routes to a static url) won't kick in.
I wonder how their nines are going. Guess they'll have to stay pretty stable for the next 100 years.
Does anyone know if having Global Accelerator set up would help right now? It's in the list of affected services, I wonder if it's useful in scenarios like this one.
Can't even get STS tokens. RDS Proxy is down, SQS, Managed Kafka.
Thing is us-east-1 the primary region for many services of AWS. DynamoDB is a very central offering used by many service. And the issue that has happend is very common[^1].
I think no matter how hard you try to avoid it, in the end there's always a massive dependency chain for modern digital infrastructure[^2].
[1]: https://itsfoss.community/uploads/default/optimized/2X/a/ad3...
[2]: https://xkcd.com/2347/
Presumably the root cause of the major Vercel outage too: https://www.vercel-status.com/
It's not DNS
There's no way it's DNS
It was DNS
I seem to recall other issues around this time in previous years. I wonder if this is some change getting shoe-horned in ahead of some reinvent release deadline...
Can confirm. I was trying to send the newsletter (with SES) and it didn't work. I was thinking my local boto3 was old, but I figured I should check HN just in case.
I don't get how you can be a trillion dollar company and still suck this much.
Paddle (payment provider) is down as well: https://paddlestatus.com/
Slack, Jira and Zoom are all sluggish for me in the UK
Coinbase down as well: https://status.coinbase.com/
My Alexa is hit or miss at responding to queries right now at 5:30 AM EST. Was wondering why it wasn't answering when I woke up.
Why would us-east-1 cause many UK banks and even UK gov web sites down too!? Shouldn't they operate in the UK region due to GDPR?
In this moments I think devs should invest in vendor independence if they can. While I'm not to that stage yet (cloudlfare dependence) using open technologies like docker (or Kubernetes), Traefik instead of managed services can help in this disaster situations by switching to a different provider in a faster way than having to rebuild from zero. as a disclosure I'm not still to that point on my infrastructure But I'm trying to slowly define one for my self
Always a lovely Monday when you wake just in time to see everything going down
Statuspage.io seems to load (but is slow) but what is the point if you can't post an incident because Atlassian ID service is down.
Sling still down at 11:42PM PST
I expect gcp and azure to gain some customers after this
Why after all these years is us-east-1 such a SPOF?
r/aws not found
There aren't any communities on Reddit with that name. Double-check the community name or start a new community.
During the last us-east-1 apocalypse 14 years ago, I started awsdowntime.com - don't make me regsiter it again and revive the page.
This will always be a risk when sharecropping.
I cannot create a support ticket with AWS as well.
Strangely some of our services are scaling up on east-1, and there is downtick on downdetector.com so issue might be resolving.
Asana down Postman workspaces don't load Slack affected And the worst: heroku scheduler just refused to trigger our jobs
Wait a second, Snapchat impacted AGAIN? It was impacted during the last GCP outage.
Slack and Zoom working intermittently for me
Only us east 1 gets new services immediately others might do but not a guarantee. Which regions are a good alternative
Impossible to connect to JIRA here (France).
My website on the cupboard laptop is fine.
His influence is so great that it caused half of the internet to stop working properly.
Perplexity also have outage.
https://status.perplexity.ai
Can't update my selfhosted HomeAssistant because HAOS depends on dockerhub which seems to be still down.
Seems to be upsetting Slack a fair bit, messages taking an age to send and OIDC login doesn't want to play.
Damn. This is why Duolingo isn't working properly right now.
GCP is underrated
Finally an upside to running on Oracle Cloud!
npm and pnpm are badly affected as well. Many packages are returning 502 when fetched. Such a bad time...
That strange feeling of the world getting cleaner for a while without all these dependant services.
On a bright note, Alexa has stopped pushing me merchandise.
These things happen when profits are the measure everything. Change your provider, but if their number doesn't go up, they wont be reliable.
So your complaints matter nothing because "number go up".
I remember the good old days of everyone starting a hosting company. We never should have left.
It's always DNS
It’s a good day to be a DR software company or consultant
Just tried to get into Seller Central, returned a 504.
Serverless is down because servers are down. What an irony.
Lots of outage happening in Norway, too. So I'm guessing it is a global thing.
Twilio seems to be affected as well
Did someone vibe code a DNS change
They haven't listed SES there yet in the affected services on their status page
Reddit itself breaking down and errors appear. Does reddit itself depends on this?
wow I think most of Mulesoft is down, that's pretty significant in my little corner of the tech world.
https://www.youtube.com/shorts/liL2VXYNyus
As of 4:26am Central Time in the USA, it's back up for one of my services.
Thanks god we built all our infra on top of EKS, so everything works smoothly =)
I wonder how much better the uptime would be if they made a sincere effort to retain engineering staff.
Right now on levels.fyi, the highest-paying non-managerial engineering role is offered by Oracle. They might not pay the recent grads as well as Google or Microsoft, but they definitely value the principal engineers w/ 20 years of experience.
10:30 on a Monday morning and already slacking off. Life is good. Time to touch grass, everybody!
Amazon.ca is degraded, some product pages load but can't see prices. Amusing.
Lots of outage in Norway, started approximately 1 hour ago for me.
Alexa devices are also down.
I still don't know why anyone would use AWS hosting.
idiocracy_window_view.jpg
Now, I may well be naive - but isn't the point of these systems that you fail over gracefully to another data centre and no-one notices?
Is this the outage that took Medium down ?
What is HN hosted on?
That's unusual.
I wss under the impression that having multiple available zones guarantees high availability.
It seems this is not the case.
Coinbase down as well
Signal is down for me
Ohno, not Fortnite! oh, the humanity.
Seems like we need more anti-trust cases on AWS or need to break it down, it is becoming too big. Services used in rest of the world get impacted by issues in one region.
thundering herd problems.... every time they say they fix it something else breaks
Both Intercom and Twilio are affected, too.
- https://status.twilio.com/ - https://www.intercomstatus.com/us-hosting
I want the web ca. 2001 back, please.
quay.io was down: https://status.redhat.com
canva.com was down until a few minutes ago.
I did get 500 error from their public ECR too
Great. Hope they’re down for a few more days and we can get some time off.
Atlassian cloud is having problems as well.
I cannot pull images from docker hub.
The RDS proxy for our postgres DB went down.
It's fun to see SRE jumping left and right when they can do basically nothing at all.
"Do we enable DR? Yes/No". That's all you can do. If you do, it's a whole machinery starting, which might take longer than the outage itself.
They can't even use Slack to communicate - messages are being dropped/not sent.
And then we laugh at the South Koreans for not having backed up their hard drives (which got burnt by actual fire, a statistically way less occurring event than an AWS outage). OK that's a huge screw up, but hey, this is not insignificant either.
What will happen now? Nothing, like nothing happened after Crowdstrike's bug last year.
Cant even login via the AWS access portal.
Clearly this is all some sort of mass delusion event, the Amazon Ring status says everything is working.
https://status.ring.com/
(Useless service status pages are incredibly annoying)
Terraform Cloud is having problem too
Signal not working here for me in AU
Too big to recover.
SES and signal seem to work again
I can't even see my EKS clusters
It shouldn’t, but it does. As a civilization, we’ve eliminated resilience wherever we could, because it’s more cost-effective. Resilience is expensive. So everything is resting on a giant pile of single point of failures.
Maybe this is the event to get everyone off of piling everything onto us-east-1 and hoping for the best, but the last few outages didn’t, so I don’t expect this one to, either.
There are plenty of ways to address this risk. But the companies impacted would have to be willing to invest in the extra operational cost and complexity. They aren’t.
Zoom is unable to send screenshots.
BGP (again)?
Can confirm, also getting hit with this.
"We should have a fail back to US-West."
"It's been on the dev teams list for a while"
"Welp....."
Apparently IMDb, an Amazon service is impacted. LOL, no multi region failover.
More and more I want to be could agnostic or multi-cloud.
Considering the history of east-1 it is fascinating that it still causes so many single point of failure incidents for large enterprises.
They are amazing at LeetCode though.
Affecting Coinbase[1] as well, which is ridiculous. Can't access the web UI at all. At their scale and importance they should be multi-region if not multi-cloud.
[1] https://status.coinbase.com
you put your sh*t in us-east-1 you need to plan for this :)
Now, I may well be naive - but isn't the point of these systems that you fail over gracefully to another data centre and no-one notices?
Surprising and sad to see how many folks are using DynamoDB There are more full featured multi-cloud options that don't lock you in and that don't have the single point of failure problems.
And they give you a much better developer experience...
Sigh
npm registry also down
Uhm... E(U)ropean sovereigny (and in general spreading the hosting as much as possbile) needed ASAP…
And yet, AMZN is up for the day. The market doesn't care. Crazy.
altavista.com is also down!
Docker is also down.
Don't miss this
Happy Monday People
Medium also.
Am i imagining it or are more things like this happening in recent weeks than usual?
My app deployed on Vercel and therefore indirectly deployed on us-east-1 was down for about 2 hours today then came back up and then went down again 10 minutes ago for 2 or 3 minutes. It seems like they are still intermittent issues happening.
Good luck to all on-callers today.
It might be an interesting exercise to map how many of our services depend on us-east-1 in one way or another. One can only hope that somebody would do something with the intel, even though it's not a feature that brings money in (at least from business perspective).
Today’s reminder: multi-region is so hard even AWS can’t get it right.
For me Reddit is down and also the amazon home page isn't showing any items for me.
Sounds like a circular error with monitoring is flooding their network with metrics and logs, causing DNS to fail and produce more errors, flooding the network. Likely root cause is something like DNS conflicts or hosts being recreated on the network. Generally this is a small amount of network traffic but the LBs are dealing with host address flux, causing the hosts to keep colliding host addresses as they attempt to resolve to a new host address which are being lost from dropped packets and with so many hosts in one AZ, there's a good chance they end up with a new conflicting address.
I didn't even notice anything was wrong today. :) Looks like we're well disconnected from the US internet infra quasi-hegemony.
I in-housed an EMR for a local clinic because of latency and other network issues taking the system offline several times a month (usually at least once a week). We had zero downtime the whole first year after bringing it all in house, and I got employee of the month for several months in a row.
O ffs. I can't even access the NYT puzzles in the meantime ... Seriously disrupted, man
Paying for resilience is expensive. not as expensive as AWS, but it's not free.
Modern companies live life on the edge. Just in time, no resilience, no flexibility. We see the disaster this causes whenever something unexpected happens - the Evergiven blocking Suez for example, let alone something like Covid
However increasingly what should be minor loss of resilience, like an AWS outage or a Crowdstrike incident, turns into major failures.
This fragility is something government needs to legislate to prevent. When one supermarket is out that's fine - people can go elsewhere, the damage is contained. When all fail, that's a major problem.
On top of that, the attitude that the entire sector has is also bad. People thing IT should tail once or twice a year and it's not a problem. If that attitude affect truly important systems it will lead to major civil projects. Any civilitsation is 3 good meals away from anarchy.
There's no profit motive to avoid this, companies don't care about being offline for the day, as long as all their mates are also offline.
seems like services are slowly recovering
seeing issues with SES in us-east-1 as well
Now I know why the documents I was sending to my Kindle didn't go through.
Curious to know how much does an outage like this cost to others.
Lost data, revenue, etc.
I'm not talking about AWS but whoever's downstream.
Is it like 100M, like 1B?
Keep going
Ironically enough I can't access Reddit due to no healthy upstream.
Ring is affected. Why doesn’t Ring have failover to another region?
Reddit seems to be having issues too:
"upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection timeout"
Substack seems to by lying about their status: https://substack.statuspage.io/
worst outage since xmas time 2012
Snow day!
Well that takes down Docker Hub as well it looks like.
It's weird that we're living in a time where this could be a taste of a prolonged future global internet blackout by adversarial nations. Get used to this feeling I guess :)
Can't log into tidal for my music
is this why docker is down?
Can't check out on Amazon.com.au, gives error page
Vercel functions are down as well.
Slack now also down: https://slack-status.com/
"serverless"
Probably related:
https://www.nytimes.com/2025/05/25/business/amazon-ai-coders...
"Pushed to use artificial intelligence, software developers at the e-commerce giant say they must work faster and have less time to think."
Every bit of thinking time spent on a dysfunctional, lying "AI" agent could be spent on understanding the system. Even if you don't move your mouse all the time in order to please a dumb middle manager.
Reminder that AZs don't go down
Entire regions go down
Don't pay for intra-az traffic friends
Slack was acting slower than usual, but did not go down. Color me impressed.
"Never choose us-east-1"
Time to start calling BS on the 9's of reliability
I love this to be honest. Validates my anti cloud stance.
this is why you avoid us-east-1
Is this why Wordle logged me out and my 2 guesses don't seem to have been recorded? I am worried about losing my streak.
workos is down too, timing is highly correlated with AWS outage: https://status.workos.com/
That means Cursor is down, can't login.
There are entire apps like Reddit that are still not working. What the fuck is going on?
Meanwhile my pair of 12 year old raspberry pi's hangling my home services like DNS survive their 3rd AWS us-east-1 outage.
"But you can't do webscale uptime on your own"
Sure. I suspect even a single pi with auto-updates on has less downtime.
99.999 percent lol
Honestly anyone do have outages, that's nothing extraordinary, what's wrong is the number of impacted services. We choose (at least almost choose) to ditch mainframes for clusters also for resilience. Now with cheap desktop iron labeled "stable enough to be a serious server" we have seen mainframes re-created sometimes with a cluster of VM on top of a single server, sometimes with cloud services.
Ladies and Gentleman's it's about time to learn reshoring in the IT world as well. Owning nothing, renting all means extreme fragility.
Slack now failing for me.
This outage is a reminder:
Economic efficiency and technical complexity are both, separately and together, enemies of resilience
Remember when the "internet will just route around a network problem"?
FFS ...
How much longer are we going to tolerate this marketing bullshit about "Designed to provide 99.999999999% durability and 99.99% availability"?
But but this is a cloud, it should exist in the cloud.
Designed to provide 99.999% durability and 99.999% availability Still designed, not implemented
hello world
Hello world
Someone’s got a case of the Monday’s.
Major us-east-1 outages happened in 2011, 2015, 2017, 2020, 2021, 2023, and now again. I understand that us-east-1, N. VA, was the first DC but for fucks sake they've had HOW LONG to finish AWS and make us-east-1 not be tied to keeping AWS up.
Good thing hyperscalers provide 100% uptime.
Someone vibecoded it down.
Looks like we're back!
So much for the peeps claiming amazing Cloud uptime ;)
This is the reason why it is important to plan Disaster recovery and also plan Multi-Cloud architectures.
Our applications and databases must have ultra high availability. It can be achieved with applications and data platforms hosted on different regions for failover.
Critical businesses should also plan for replication across multiple cloud platforms. You may use some of the existing solutions out there that can help with such implementations for data platforms.
- Qlik replicate - HexaRocket
and some more.
Or rather implement native replication solutions available with data platforms.