Human error caused Amazon Web Services outage, Apple iCloud service issues

Posted:
in General Discussion
Tuesday's major Amazon Web Services outage was caused through human error, the retailer has confirmed, with the downtime that impacted a number of online services, including Apple's, traced back to a single wrongly-entered command performed during debugging.




The note to customers for the S3 (Simple Storage Service) disruption for the US-East-1 region advises the team were working on an issue that caused the S3 billing system run slower than expected. One team member executed a command from an "established playbook" to take down a small number of servers used for a subsystem in the billing process, but mistakenly took down more than required.

"Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended," the Amazon note states.

The extra servers were used to support two other S3 subsystems, one being the "index subsystem" used to manage metadata and location information for all S3 objects in a region, required for the service to perform data storage and management tasks. The second "placement subsystem" relied on the index subsystem in order to function, and is used to allocate storage for new data.

Enough servers were taken down in both of these subsystems caused a drop in capacity, forcing the team to restart all of the systems. During this restart period, S3 was unable to service requests, with it also impacting other AWS services in the region, including Amazon's Elastic Compute Cloud (EC2), Elastic Block Store (EBS) volumes, AWS Lambda, and the S3 console.

S3's subsystems are said by Amazon to be "designed to support the removal or failure of significant capacity with little or no customer impact," built with the assumption that systems will fail and can be replaced by another. Noting there has not been a complete restart of the index subsystem for "many years," the massive growth of AWS has caused the process of restarting the services and running safety checks took "longer than expected."

In order to prevent such a mistake from impacting assorted service as profoundly again, the tool has been modified to remove capacity more slowly, with added safeguards that will maintain the minimum required capacity level for each subsystem. Other operational tools will also undergo auditing to ensure they have similar checks in place.

Additionally, work is being carried out on the index subsystem to repartition it, dividing it down into smaller sections to speed up the recovery time.

The Service Health Dashboard, a page that displayed to AWS users the status of services, failed to show that there was an issue during the downtime, as it relied on S3 in order to function and couldn't update. Amazon's updating the dashboard so that it functions across multiple AWS regions, making sure it works without being dependent on any single region.

Amazon ends the note by apologizing for the impact of the event on its customers. "While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their business."

"We will do everything we can to learn from this event and use it to improve our availability even further."

The outage caused a number of websites that relied on S3 to suffer issues, as well as a number of apps that used Amazon's cloud servers for their services. Apple customers were also affected by the outage, with some users of the iOS and Mac App Stores, iCloud Drive, Notes, iCloud backup, Apple TV, and Apple Music encountering issues during the downtime.

Apple is believed to be making progress moving away from relying on Amazon for its cloud services, by creating its own data centers instead. Apple's Mesa facility is being turned into a "global command center," with the company working to establish new data centers in Ireland and Denmark.

Apple's existing Reno data center, handling Siri, FaceTime, and iMessage among other tasks, may increase its size in the future. It was recently reported Apple is planning to expand the data center by over 375,000 square feet, at a cost of around $50.7 million.
«13

Comments

  • Reply 1 of 42
    NY1822NY1822 Posts: 621member
    for some reason this made me think autonomous cars can't come soon enough...imagine all the human error that can go wrong getting behind the wheel at 50 mph
  • Reply 2 of 42
    Am I the only one who STILL can't log into iCloud from this outage?
  • Reply 3 of 42
    maestro64maestro64 Posts: 5,043member
    Why is their system still using command line prompts, this is the reason Unix is not favor for your average IT worker one wrong syntax error and you delete everything. Sounds like Amazon may have gotten off easy on this on.
  • Reply 4 of 42
    macxpressmacxpress Posts: 5,801member
    lekowsky5 said:
    Am I the only one who STILL can't log into iCloud from this outage?
    Yup...
  • Reply 5 of 42
    MplsPMplsP Posts: 3,911member
    NY1822 said:
    for some reason this made me think autonomous cars can't come soon enough...imagine all the human error that can go wrong getting behind the wheel at 50 mph
    Except when some sleep deprived, slightly hungover coder at Ford makes a mistake it causes 100,000 cars to crash instead of one...
    elijahgargonautkiltedgreen
  • Reply 6 of 42
    macxpressmacxpress Posts: 5,801member

    maestro64 said:
    Why is their system still using command line prompts, this is the reason Unix is not favor for your average IT worker one wrong syntax error and you delete everything. Sounds like Amazon may have gotten off easy on this on.
    Umm....a lot of things today are still done with command line prompts. You can be a lot more efficient with the command line (running scripts, etc) than with a GUI. If you've ever logged into an enterprise grade Cisco switch you don't do it with a GUI, you do it with a command line interface...sometimes using a DB-9 cable. 
    afrodrielijahg
  • Reply 7 of 42
    sog35 said:
    this is bad design by Amazon
    Is it just in your nature to comment on things you know nothing about? Or is it a disease?
    sirlance99afrodrielijahgRayz2016singularity
  • Reply 8 of 42
    sirlance99sirlance99 Posts: 1,293member
    sog35 said:
    this is bad design by Amazon
    Oh shut up. They run the number one cloud services by far. More than the next several combined. Every single major company, including Apple, uses them. What they do and how they do it while keeping everything up and running is a major feet that no one can compare to. 
  • Reply 9 of 42
    fallenjtfallenjt Posts: 4,053member
    sog35 said:
    this is bad design by Amazon
    Just like the Oscar night!
  • Reply 10 of 42
    sirlance99sirlance99 Posts: 1,293member
    sog35 said:
    but,but,but, but, Apple's cloud sucks...
    It does
    anantksundaram
  • Reply 11 of 42
    dee_deedee_dee Posts: 110member
    sog35 said:
    this is bad design by Amazon
    Is it just in your nature to comment on things you know nothing about? Or is it a disease?
    Let's get this straight.  An employee types in a wrong command, and Amazon's world wide operations go down, and you seem to think this is not bad design?  Please do enlighten us "Mr. I'm so smart" ?
    anantksundaram
  • Reply 12 of 42
    fallenjtfallenjt Posts: 4,053member
    macxpress said:

    maestro64 said:
    Why is their system still using command line prompts, this is the reason Unix is not favor for your average IT worker one wrong syntax error and you delete everything. Sounds like Amazon may have gotten off easy on this on.
    Umm....a lot of things today are still done with command line prompts. You can be a lot more efficient with the command line (running scripts, etc) than with a GUI. If you've ever logged into an enterprise grade Cisco switch you don't do it with a GUI, you do it with a command line interface...sometimes using a DB-9 cable. 
    You don't get such as stupid mistake like this with GUI. Command lines are so 20th century.
  • Reply 13 of 42
    slurpyslurpy Posts: 5,382member
    sog35 said:
    this is bad design by Amazon
    Is it just in your nature to comment on things you know nothing about? Or is it a disease?
    Going by 100% of sog's posts, I'd say all of the above.
    singularity
  • Reply 14 of 42
    lkrupplkrupp Posts: 10,557member
    fallenjt said:
    macxpress said:

    maestro64 said:
    Why is their system still using command line prompts, this is the reason Unix is not favor for your average IT worker one wrong syntax error and you delete everything. Sounds like Amazon may have gotten off easy on this on.
    Umm....a lot of things today are still done with command line prompts. You can be a lot more efficient with the command line (running scripts, etc) than with a GUI. If you've ever logged into an enterprise grade Cisco switch you don't do it with a GUI, you do it with a command line interface...sometimes using a DB-9 cable. 
    You don't get such as stupid mistake like this with GUI. Command lines are so 20th century.
    Pfft! Yes you can. I got turned around inside a multiplexer and took down 12 T3s before I realized what I’d done. FCC reportable outage, big fine for AT&T. I almost got three days off without pay. Only the second level manager saved my ass that day. And I was using a point and click GUI to rearrange digital cross-connects. Stupid is as stupid does.
    edited March 2017 afrodriJdmr1701pscooter63randominternetperson1st
  • Reply 15 of 42
    volcanvolcan Posts: 1,799member
    fallenjt said:
    You don't get such as stupid mistake like this with GUI. Command lines are so 20th century.
    Depending on who codes the GUI interface it can be very vague as to what will happen when you are about to click on something and functions are hidden all over the place in different screens. Case in point: try to navigate Network Solutions GUI configuration tools. A complete nightmare. I've been using UNIX command line for internet servers for more than 20 years and it is by far more powerful, fastest with fewer mistakes than any GUI I have ever used. On the command, line simple typos usually cause the command to be rejected. Low level programming is really a trust issue. Someone needs to code everything even GUIs. Mistakes happen and the more complex the task, the greater the risk.
    groakester
  • Reply 16 of 42
    maestro64maestro64 Posts: 5,043member
    lkrupp said:
    fallenjt said:
    macxpress said:

    maestro64 said:
    Why is their system still using command line prompts, this is the reason Unix is not favor for your average IT worker one wrong syntax error and you delete everything. Sounds like Amazon may have gotten off easy on this on.
    Umm....a lot of things today are still done with command line prompts. You can be a lot more efficient with the command line (running scripts, etc) than with a GUI. If you've ever logged into an enterprise grade Cisco switch you don't do it with a GUI, you do it with a command line interface...sometimes using a DB-9 cable. 
    You don't get such as stupid mistake like this with GUI. Command lines are so 20th century.
    Pfft! Yes you can. I got turned around inside a multiplexer and took down 12 T3s before I realized what I’d done. FCC reportable outage, big fine for AT&T. I almost got three days off without pay. Only the second level manager saved my ass that day. And I was using a point and click GUI to rearrange digital cross-connects. Stupid is as stupid does.


    Yep sounds like the failure of the GUI it did not have error check system to verify what you were doing is really what you want to do. I grew up on mainframe computers with nothing but command line prompts. I also learned on Unix machine and work won VAX and it had command line system, their OS had fail safes built in, and time you would type a command which had could have a really bad effect like deleting a directory or disabling something it could come back and ask you if you really want to delete the directory or such. It always made you double check what you were doing. I love the mac back in 84 since it would not allow you to wipe our out your file system or such and I use this before working on a Mainframe of Unix systems.

    GUI systems are suppose to check and ask the user to verify what they are doing especially if it would disrupt services. I work for a Networking company after leaving the computer industry and out system which competed with cisco at the time had really nice GUI to configure and manage the network switches and it was one of its best selling feature because it would not allow you to disrupt service without serious override and doubt checks on what you were doing.

    Yes human make mistakes, but a command line system with the ability to bring everything down is not the way you run a multiply billion $ business.

  • Reply 17 of 42
    coolfactorcoolfactor Posts: 2,239member
    sog35 said:
    this is bad design by Amazon

    Design is organic, or at least should be. Lessons get learned over time and through experience. Yes, it sounds like they were a bit lax since they haven't had to restart the affected subsystems "in many years", and they learned this lesson the hard way.

    What is bad design is consumer-facing services that rely 100% on S3 without any fall-back systems. There were clearly many of those.
    Jdmr1701elijahg
  • Reply 18 of 42
    coolfactorcoolfactor Posts: 2,239member
    sog35 said:
    but,but,but, but, Apple's cloud sucks...

    I'd call that purely subjective. I've used Apple's cloud to keep my most important data (emails, contacts, passwords and bookmarks) safe for 15 years, as they transitioned from iTools > .mac > MobileMe > iCloud. Has it always been perfect? No, but none of my data has been lost over those years and on many different Macs.

    Today, I've got many documents happily stored in iCloud across many applications.

    It works for me.
  • Reply 19 of 42
    kkerstkkerst Posts: 330member
    lekowsky5 said:
    Am I the only one who STILL can't log into iCloud from this outage?
    No, well, not exactly. I can't login to iCloud Drive from any device - Windows, iOS, or macOS. If it's because of this AWS fat finger, it sucks.
  • Reply 20 of 42
    dee_deedee_dee Posts: 110member
    sog35 said:
    this is bad design by Amazon

    What is bad design is consumer-facing services that rely 100% on S3 without any fall-back systems. There were clearly many of those.
    Are you aware of the engineering involved in mirroring your storage with 2 separate providers?  That was a rhetorical question.  Of course you don't. 
    edited March 2017
Sign In or Register to comment.