Outages Extension
LocalStack Outages Extension can simulate outages for any AWS region or service. You can install and use the Outages Extension through LocalStack Extension mechanism to test infrastructure resilience by intentionally causing service outages and observing the system's recovery in scenarios with incomplete infrastructure is an effective approach. This method evaluates the system's deployment mechanisms and its ability to handle and recover from infrastructure anomalies, a critical aspect of chaos engineering.
Outages Extension is currently available as part of the LocalStack Enterprise plan. If you'd like to try it out, please contact us to request access.
Getting started
This guide is designed for users who are new to Outages Extension. We'll simulate partial outages by interrupting specific services, such as halting an ECS instance creation or disrupting a database service. By closely watching Terraform's responses and the status of AWS resources, you'll learn how Terraform manages these disruptions.
For this particular example, we'll be using a Terraform configuration file from a sample application repository. Clone the repository, and follow the instructions below to get started.
Prerequisites
The general prerequisites for this guide are:
- LocalStack Pro with LocalStack CLI & LocalStack Auth Token
- AWS CLI with the
awslocal
wrapper - Docker and Docker Compose
- Terraform and
tflocal
wrapper.
Start LocalStack by using the docker-compose.yml
file from the repository. Ensure to set your Auth Token as an environment variable during this process.
$ LOCALSTACK_AUTH_TOKEN=<YOUR_LOCALSTACK_AUTH_TOKEN>
$ docker compose up
Installing the extension
To install the LocalStack Outages Extension, first set up your LocalStack Auth Token in your environment. Once the token is configured, use the command below to install the extension:
$ localstack extensions install localstack-extension-outages
Alternatively, you can enable automatic installation of the extension by setting the environment variable EXTENSION_AUTO_INSTALL=localstack-extension-outages
when you start the LocalStack container. This can be done by including it in your docker
command line interface (CLI) or in your docker-compose
configuration as an environment variable.
Follow our Managing Extensions documentation for more information on how to install & manage extensions.
Running Terraform
To get started, initialize & apply the Terraform configuration using the tflocal
CLI to create the local resources. The Terraform configuration file operates independently of the application, meaning the application won't be available during this phase. To deploy the entire stack, including the application, refer to the sample repository.
$ tflocal init
$ tflocal plan
$ tflocal apply
The following output would be retrieved:
Apply complete! Resources: 57 added, 0 changed, 0 destroyed.
Outputs:
api_id = "3eed6d1d"
api_invoke_url = "https://3eed6d1d.execute-api.us-east-1.amazonaws.com"
api_invoke_url_foodstore_foods = "https://3eed6d1d.execute-api.us-east-1.amazonaws.com/foodstore/foods/{foodId}"
api_invoke_url_petstore_pets = "https://3eed6d1d.execute-api.us-east-1.amazonaws.com/petstore/domestic/pets/{petId}"
api_test_page = <sensitive>
container_security_group = "sg-db749514a062de41c"
ecs_cluster_name = "arn:aws:ecs:us-east-1:000000000000:cluster/ecs-cluster"
private_dns_namespace = "60bfac90"
vpc_id = "vpc-f9d6b124"
Next, you can update certain resources. This includes increasing the number of tasks in the task_definition
for the ECS service from 3 to 5 and upgrading the openapi
specification version used by API Gateway from 3.0.1 to 3.1.0.
Simulating outages
After running the Terraform plan
command to preview these changes, you can simulate an outage affecting the ECS and API Gateway V2 services before applying the changes. To do this, execute the following command:
$ curl --location --request POST 'http://outages.localhost.localstack.cloud:4566/outages' \
--header 'Content-Type: application/json' \
--data-raw '[
{
"service": "ecs",
"region": "us-east-1"
},
{
"service": "apigatewayv2",
"region": "us-east-1"
}
]'
In the LocalStack logs, you'll notice that during the periods between successful calls, the controlled outages are marked by a ServiceUnavailableException
accompanied by a 503 HTTP status code. These exceptions specifically affect the targeted AWS APIs.
2023-11-09T21:53:31.801 INFO --- [ asgi_gw_9] localstack.request.aws : AWS ec2.GetTransitGatewayRouteTableAssociations => 200
2023-11-09T21:53:31.824 INFO --- [ asgi_gw_2] localstack.request.aws : AWS apigatewayv2.GetVpcLink => 503 (ServiceUnavailableException)
2023-11-09T21:53:31.828 INFO --- [ asgi_gw_6] localstack.request.aws : AWS servicediscovery.ListTagsForResource => 200
2023-11-09T21:53:31.831 INFO --- [ asgi_gw_8] localstack.request.aws : AWS ec2.DescribeRouteTables => 200
2023-11-09T21:53:31.834 INFO --- [ asgi_gw_7] localstack.request.aws : AWS servicediscovery.ListTagsForResource => 200
2023-11-09T21:53:31.836 INFO --- [ asgi_gw_0] localstack.request.aws : AWS ec2.DescribePrefixLists => 200
2023-11-09T21:53:31.842 INFO --- [ asgi_gw_1] localstack.request.aws : AWS ec2.DescribeSecurityGroups => 200
2023-11-09T21:53:31.848 INFO --- [ asgi_gw_6] localstack.request.aws : AWS ec2.GetTransitGatewayRouteTablePropagations => 200
2023-11-09T21:53:31.876 INFO --- [ asgi_gw_9] localstack.request.aws : AWS ec2.DescribeRouteTables => 200
2023-11-09T21:53:31.879 INFO --- [ asgi_gw_5] localstack.request.aws : AWS ec2.DescribeRouteTables => 200
2023-11-09T21:53:32.205 INFO --- [ asgi_gw_8] localstack.request.aws : AWS ecs.DescribeClusters => 503 (ServiceUnavailableException)
2023-11-09T21:53:32.280 INFO --- [ asgi_gw_3] localstack.request.aws : AWS ecs.DescribeTaskDefinition => 503 (ServiceUnavailableException)
2023-11-09T21:53:32.443 INFO --- [ asgi_gw_0] localstack.request.aws : AWS ecs.DescribeTaskDefinition => 503 (ServiceUnavailableException)
2023-11-09T21:53:32.584 INFO --- [ asgi_gw_6] localstack.request.aws : AWS apigatewayv2.GetVpcLink => 503 (ServiceUnavailableException)
2023-11-09T21:53:33.271 INFO --- [ asgi_gw_9] localstack.request.aws : AWS ecs.DescribeClusters => 503 (ServiceUnavailableException)
2023-11-09T21:53:33.473 INFO --- [ asgi_gw_2] localstack.request.aws : AWS ecs.DescribeTaskDefinition => 503 (ServiceUnavailableException)
2023-11-09T21:53:33.889 INFO --- [ asgi_gw_7] localstack.request.aws : AWS ecs.DescribeTaskDefinition => 503 (ServiceUnavailableException)
During infrastructure provisioning, depending on the tool and provider used, attempts may be made to reapply changes to resources following a failure, or the action might simply fail.
Simulating shutdowns
To simulate the shutdown of an entire region, execute the following command:
$ curl --location --request POST 'http://outages.localhost.localstack.cloud:4566/outages' \
--header 'Content-Type: application/json' \
--data-raw '[
{
"service": "*",
"region": "us-east-1"
}
]'
Other operations
To stop outages, submit an empty list in the configuration using the following POST
request:
$ curl --location --request POST 'http://outages.localhost.localstack.cloud:4566/outages' \
--header 'Content-Type: application/json' \
--data-raw '[]'
To view the current configuration, use this GET
request:
$ curl --location --request GET 'http://outages.localhost.localstack.cloud:4566/outages'
To add a new service/region rule to the configuration, use a PATCH
request as shown below:
$ curl --location --request PATCH 'http://outages.localhost.localstack.cloud:4566/outages' \
--header 'Content-Type: application/json' \
--data-raw '[{"service": "transcribe", "region": "us-west-1"}]'
To remove a service/region rule from the configuration, execute a DELETE
request as follows:
$ curl --location --request DELETE 'http://outages.localhost.localstack.cloud:4566/outages' \
--header 'Content-Type: application/json' \
--data-raw '[{"service": "transcribe", "region": "us-west-1"}]'
Conclusion
By closely watching Terraform's responses and the status of cloud resources, you'll learn how Terraform manages these disruptions. It's important to note how it attempts to retry operations, whether it rolls back changes or faces partial failures, and how it logs these incidents.
This is crucial for understanding the resilience of your infrastructure provisioning against challenging conditions. It also aids in enhancing your IaC configurations, ensuring they are more robust and effective in handling faults and errors in real-life situations.