Amazon Web Services disruption was caused by human error
The Amazon Web Services outage that shut down a large portion of websites and apps on the East Coast on Feb. 28 was caused by an employee who made a typo in the command input, according to Amazon officials.
The root cause was in the Simple Storage Service billing system, or S3. The service was running slowly, so engineers began to debug the issue. Officials said an employee entered a command to take some of the servers offline that were used by the billing servers.
However, the employee made an error inputting the command, which removed a larger set of servers at the company's North Virginia datacenters. These servers supported two other S3 subsystems, including the index system that manages metadata and location of all S3 objects in the region, officials said. The second server was the placement server, which allocates new requests.
Amazon's status page was also affected during the outage, which made it appear all sites were running normally – although a portion of Amazon's sites were affected.
As a result of the disruption, Amazon is making several changes to operations., including the removal of capacity, officials said. The tool was modified to make slower changes and more safeguards were added to prevent full removal of capacity to prevent future disruptions.
Amazon is also auditing other operational tools to bolster similar safety checks in other operations. And breaking services into smaller call cells, to allow engineers to assess and thoroughly test recovery processes.
"During this event, the recovery time of the index subsystem still took longer than we expected," officials said. "The S3 team had planned further partitioning of the index subsystem later this year. We're reprioritizing that work to begin immediately.
"While we're proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users and their businesses. We'll do everything we can to learn from this event and use it to improve our availability even further."
Amazon was not the first service taken down with a typo this year. The content delivery network, Cloudflare, had a massive data leak known as CloudBleed, after a typo in one of its components. Additionally, a typo in the cryptocurrency Zerocash source code, let a hacker steal over $500,000.