aws had a rough week.
a routine dynamodb api update in us-east-1 triggered a rare automation bug. two systems tried to sync dns records at the same time and overwrote each other. that tiny race condition broke internal dns resolution for dynamodb. suddenly, a lot of aws just couldn’t find dynamodb anymore.
then came the domino effect. lambda, ec2, sqs, step functions, redshift—anything touching dynamodb—started timing out. queues piled up, services froze, cloud apps everywhere slowed or stopped.
worse, the internal network load balancer monitor got bad data from the broken dns, marked healthy servers as dead, and pulled them out of rotation. traffic shifted to fewer machines and the overload got worse.
around 2am pdt the aws team traced it to dns resolution. by 2:24 they fixed the records, stopped the cascade, and dynamodb was back. recovery for dependent services took hours.
no cyberattack, just a bad automation process. a small bug in a big system.
root cause: faulty automation and a race condition in internal dns sync for dynamodb api in us-east-1. lasted about 9 hours. broke major apps like snapchat, reddit, fortnite, and over a hundred others.
aws says they’re tightening monitoring and automation safety.
complex systems fail in weird ways. this one was a reminder.
extra - https://t.co/TfK9BPEOnZ