Human error caused Amazon Web Services outage, Apple iCloud service issues
Tuesday's major Amazon Web Services outage was caused through human error, the retailer has confirmed, with the downtime that impacted a number of online services, including Apple's, traced back to a single wrongly-entered command performed during debugging.

The note to customers for the S3 (Simple Storage Service) disruption for the US-East-1 region advises the team were working on an issue that caused the S3 billing system run slower than expected. One team member executed a command from an "established playbook" to take down a small number of servers used for a subsystem in the billing process, but mistakenly took down more than required.
"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the Amazon note states.
The extra servers were used to support two other S3 subsystems, one being the "index subsystem" used to manage metadata and location information for all S3 objects in a region, required for the service to perform data storage and management tasks. The second "placement subsystem" relied on the index subsystem in order to function, and is used to allocate storage for new data.
Enough servers were taken down in both of these subsystems caused a drop in capacity, forcing the team to restart all of the systems. During this restart period, S3 was unable to service requests, with it also impacting other AWS services in the region, including Amazon's Elastic Compute Cloud (EC2), Elastic Block Store (EBS) volumes, AWS Lambda, and the S3 console.
S3's subsystems are said by Amazon to be "designed to support the removal or failure of significant capacity with little or no customer impact," built with the assumption that systems will fail and can be replaced by another. Noting there has not been a complete restart of the index subsystem for "many years," the massive growth of AWS has caused the process of restarting the services and running safety checks took "longer than expected."
In order to prevent such a mistake from impacting assorted service as profoundly again, the tool has been modified to remove capacity more slowly, with added safeguards that will maintain the minimum required capacity level for each subsystem. Other operational tools will also undergo auditing to ensure they have similar checks in place.
Additionally, work is being carried out on the index subsystem to repartition it, dividing it down into smaller sections to speed up the recovery time.
The Service Health Dashboard, a page that displayed to AWS users the status of services, failed to show that there was an issue during the downtime, as it relied on S3 in order to function and couldn't update. Amazon's updating the dashboard so that it functions across multiple AWS regions, making sure it works without being dependent on any single region.
Amazon ends the note by apologizing for the impact of the event on its customers. "While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their business."
"We will do everything we can to learn from this event and use it to improve our availability even further."
The outage caused a number of websites that relied on S3 to suffer issues, as well as a number of apps that used Amazon's cloud servers for their services. Apple customers were also affected by the outage, with some users of the iOS and Mac App Stores, iCloud Drive, Notes, iCloud backup, Apple TV, and Apple Music encountering issues during the downtime.
Apple is believed to be making progress moving away from relying on Amazon for its cloud services, by creating its own data centers instead. Apple's Mesa facility is being turned into a "global command center," with the company working to establish new data centers in Ireland and Denmark.
Apple's existing Reno data center, handling Siri, FaceTime, and iMessage among other tasks, may increase its size in the future. It was recently reported Apple is planning to expand the data center by over 375,000 square feet, at a cost of around $50.7 million.

The note to customers for the S3 (Simple Storage Service) disruption for the US-East-1 region advises the team were working on an issue that caused the S3 billing system run slower than expected. One team member executed a command from an "established playbook" to take down a small number of servers used for a subsystem in the billing process, but mistakenly took down more than required.
"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the Amazon note states.
The extra servers were used to support two other S3 subsystems, one being the "index subsystem" used to manage metadata and location information for all S3 objects in a region, required for the service to perform data storage and management tasks. The second "placement subsystem" relied on the index subsystem in order to function, and is used to allocate storage for new data.
Enough servers were taken down in both of these subsystems caused a drop in capacity, forcing the team to restart all of the systems. During this restart period, S3 was unable to service requests, with it also impacting other AWS services in the region, including Amazon's Elastic Compute Cloud (EC2), Elastic Block Store (EBS) volumes, AWS Lambda, and the S3 console.
S3's subsystems are said by Amazon to be "designed to support the removal or failure of significant capacity with little or no customer impact," built with the assumption that systems will fail and can be replaced by another. Noting there has not been a complete restart of the index subsystem for "many years," the massive growth of AWS has caused the process of restarting the services and running safety checks took "longer than expected."
In order to prevent such a mistake from impacting assorted service as profoundly again, the tool has been modified to remove capacity more slowly, with added safeguards that will maintain the minimum required capacity level for each subsystem. Other operational tools will also undergo auditing to ensure they have similar checks in place.
Additionally, work is being carried out on the index subsystem to repartition it, dividing it down into smaller sections to speed up the recovery time.
The Service Health Dashboard, a page that displayed to AWS users the status of services, failed to show that there was an issue during the downtime, as it relied on S3 in order to function and couldn't update. Amazon's updating the dashboard so that it functions across multiple AWS regions, making sure it works without being dependent on any single region.
Amazon ends the note by apologizing for the impact of the event on its customers. "While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their business."
"We will do everything we can to learn from this event and use it to improve our availability even further."
The outage caused a number of websites that relied on S3 to suffer issues, as well as a number of apps that used Amazon's cloud servers for their services. Apple customers were also affected by the outage, with some users of the iOS and Mac App Stores, iCloud Drive, Notes, iCloud backup, Apple TV, and Apple Music encountering issues during the downtime.
Apple is believed to be making progress moving away from relying on Amazon for its cloud services, by creating its own data centers instead. Apple's Mesa facility is being turned into a "global command center," with the company working to establish new data centers in Ireland and Denmark.
Apple's existing Reno data center, handling Siri, FaceTime, and iMessage among other tasks, may increase its size in the future. It was recently reported Apple is planning to expand the data center by over 375,000 square feet, at a cost of around $50.7 million.
Comments
Umm....a lot of things today are still done with command line prompts. You can be a lot more efficient with the command line (running scripts, etc) than with a GUI. If you've ever logged into an enterprise grade Cisco switch you don't do it with a GUI, you do it with a command line interface...sometimes using a DB-9 cable.
Yep sounds like the failure of the GUI it did not have error check system to verify what you were doing is really what you want to do. I grew up on mainframe computers with nothing but command line prompts. I also learned on Unix machine and work won VAX and it had command line system, their OS had fail safes built in, and time you would type a command which had could have a really bad effect like deleting a directory or disabling something it could come back and ask you if you really want to delete the directory or such. It always made you double check what you were doing. I love the mac back in 84 since it would not allow you to wipe our out your file system or such and I use this before working on a Mainframe of Unix systems.
GUI systems are suppose to check and ask the user to verify what they are doing especially if it would disrupt services. I work for a Networking company after leaving the computer industry and out system which competed with cisco at the time had really nice GUI to configure and manage the network switches and it was one of its best selling feature because it would not allow you to disrupt service without serious override and doubt checks on what you were doing.
Yes human make mistakes, but a command line system with the ability to bring everything down is not the way you run a multiply billion $ business.
Design is organic, or at least should be. Lessons get learned over time and through experience. Yes, it sounds like they were a bit lax since they haven't had to restart the affected subsystems "in many years", and they learned this lesson the hard way.
What is bad design is consumer-facing services that rely 100% on S3 without any fall-back systems. There were clearly many of those.
I'd call that purely subjective. I've used Apple's cloud to keep my most important data (emails, contacts, passwords and bookmarks) safe for 15 years, as they transitioned from iTools > .mac > MobileMe > iCloud. Has it always been perfect? No, but none of my data has been lost over those years and on many different Macs.
Today, I've got many documents happily stored in iCloud across many applications.
It works for me.