

During this event, the recovery time of the index subsystem still took longer than we expected. As S3 has scaled, the team has done considerable work to refactor parts of the service into smaller cells to reduce blast radius and improve recovery. By factoring services into cells, engineering teams can assess and thoroughly test recovery processes of even the largest service or subsystem. One of the most important involves breaking services into small partitions which we call cells. We employ multiple techniques to allow our services to recover from any failure quickly. We will also make changes to improve the recovery time of key S3 subsystems. We are also auditing our other operational tools to ensure we have similar safety checks. This will prevent an incorrect input from triggering a similar event in the future. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level. While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We are making several changes as a result of this operational event. Some of these services had accumulated a backlog of work during the S3 disruption and required additional time to fully recover. Other AWS services that were impacted by this event began recovering. At this point, S3 was operating normally. The placement subsystem began recovery when the index subsystem was functional and finished recovery at 1:54PM PST. The S3 PUT API also required the placement subsystem. By 1:18PM PST, the index subsystem was fully recovered and GET, LIST, and DELETE APIs were functioning normally. By 12:26PM PST, the index subsystem had activated enough capacity to begin servicing S3 GET, LIST, and DELETE requests. The index subsystem was the first of the two affected subsystems that needed to be restarted. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected. While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. We build our systems with the assumption that things will occasionally fail, and we rely on the ability to remove and replace capacity as one of our core operational processes. S3 subsystems are designed to support the removal or failure of significant capacity with little or no customer impact. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable. While these subsystems were being restarted, S3 was unable to service requests.
VAC BAN REMOVER 2017 MARCH FULL
Removing a significant portion of the capacity caused each of these systems to require a full restart. The placement subsystem is used during PUT requests to allocate storage for new objects. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests.

One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. The servers that were inadvertently removed supported two other S3 subsystems. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th, 2017.
