In this tutorial, we will use a simple application composed of 3 microservices, to understand how to use Gremlin to systematically inject a failure and test whether the microservices behaved in the expected manner during the failure. Specifically, we will be validating if the microservices implemented stability patterns to handle the failure. When compared with simply injecting faults (killing VMs, containers or failing requests randomly), one of the main advantages of this systematic approach is that it gives the tester a good idea of where things might be going wrong. He/She could then quickly fix the service, rebuild, redeploy, and test again.
This example, while contrived, serves to illustrate the benefits of systematically testing your microservices application for failure recovery instead of randomly injecting failures without any useful validation.
-
Docker and docker-compose
-
REST client (curl, Chrome + Postman, etc.)
This tutorial will assume that you are using Chome + Postman to make REST API calls to our sample application
-
Setup Gremlin Python SDK
vagrant@vagrant-ubuntu-trusty-64:~$ git clone https://github.com/ResilienceTesting/gremlinsdk-python vagrant@vagrant-ubuntu-trusty-64:~$ cd gremlinsdk-python/python vagrant@vagrant-ubuntu-trusty-64:~gremlinsdk-python/python $ sudo python setup.py install
-
Setup the simple microservice application
For trying out some of the recipes, we will be using a simple bookinfo application made of three microservices and an API gateway service (gateway) facing the user. The API gateway calls the productpage microservice, which in turn relies on details microservice for ISBN info and the reviews microservice for editorial reviews. The SDK contains all the code necessary to build out the Docker containers pertaining to each microservice. The application is written using Python's Flask framework.
Lets first build the Docker images for each microservice.
cd gremlinsdk-python/exampleapp; ./build-apps.sh
The Docker images for the API gateway and the productpage service have the gremlinproxy embedded inside them as a sidecar process. The microservices are connected to each other using Docker links. The entire application can be launched using
docker-compose. In real world, the microservices would be registering themselves with a service registry (e.g., Consul, Etcd, Zookeeper, etc.) and using a service proxy (i.e., dependency injection pattern), to dynamically discover the locations of other services and invoke their APIs. The gremlinproxy provided in this example is a simple reference implementation of a service proxy that relies on a static configuration file to indicate the location of other microservices.In addition to the 4 microservices for the Bookinfo app, there is a Logstash forwarder and an Elasticsearch container. Event logs from the Gremlin proxies are forwarded by the Logstash forwarder to the Elasticsearch server. The Gremlin SDK queries this Elasticsearch server during the behavior validation phase.
cd gremlinsdk-python/exampleapp; ./runapps.shOpen Postman and access the URL http://localhost:9080/productpage to make sure the page is up.
Lets run a very simple Gremlin recipe that fakes the overload of the reviews service (without needing to crash the service) and checks if the productpage service handles this scenario using the timeout pattern. The figure below illustrates the failure scenario and shows both the expected and the unexpected behavior of the application. As noted earlier, this is a very contrived example meant for the purpose of illustration. In real world, you would be using a circuit breaker pattern to recover from such failures.
While it is possible to express Gremlin recipes purely in Python code, for
the purpose of this tutorial, we will be using a simple generic test
harness (gremlinsdk-python/exampleapp/recipes/run_recipe_json.py) that takes as input
three JSON files: the application's dependency graph, the failure scenario
and the assertions. You will find the following three JSON files in the
gremlinsdk-python/exampleapp/recipes folder:
-
topology.jsondescribes the applicaton topology for the bookinfo application that we setup earlier. -
gremlins.jsondescribes the failure scenario, wherein the reviews service is overloaded. A symptom of this scenario is extremely delayed responses from the reviews service. In our case, responses will be delayed by 8 seconds.Scoping failures to synthetic users: As we are doing this test in production, we don't want to affect real users with our failure tests. So lets restrict the failures to a set of synthetic requests. We distinguish synthetic requests using a special HTTP header
X-Gremlin-ID. Only requests carrying this header will be subjected to fault injection. Since multiple tests could be running simultaneously, we distinguish our test using a specific header value,testUser-timeout-*. So any request from productpage to reviews that contains the HTTP headerX-Gremlin-ID: testUser-timeout-<someval>will be subjected to the overload failure described in this JSON file. -
checklist.jsondescribes the list of behaviors we want to validate during such a scenario. In our case, we will check if the productpage service times out its API call to reviews service and responds to the gateway service within 100ms. This behavior is termed as bounded_response_time in thechecklist.jsonfile.
Lets run the recipe.
cd gremlinsdk-python/exampleapp/recipes; ./run_recipe_json.py topology.json gremlins.json checklist.jsonYou should see the following output:
Use postman to inject test requests,
with HTTP header X-Gremlin-ID: <header-value>
press Enter key to continue to validation phase
Note: Realistically, load injection would be performed as part of the test script. However, for the purposes of this tutorial, lets manually inject the load into the application so that we can visually observe the impact of fault injection and failure handling.
Go back to Postman. Add X-Gremlin-ID to the header field and set
testUser-timeout-1 as the value for the header.
Load the page (http://localhost:9080/productpage) and you should see that the page takes more than 8 seconds to load.
This page load is an example of poor handling of the failure scenario. The reviews service was overloaded. It took a long time to respond. productpage service that was dependent on the reviews service, did not timeout its API call.
Now, disable the header field in Postman and reload the page. You should
see that the web page loads in less than 100ms without
X-Gremlin-ID. In other words, normal traffic remains unaffected,
while only "tagged" test traffic carrying the X-Gremlin-ID header is
subjected to failure injection.
Go back to console and complete the recipe execution, i.e., run the behavior validation step.
Hit the enter key on the console
The validation code parses the log entries from gremlin service proxies to
check if the productpage service loaded in less than 100ms for requests
containing X-Gremlin-ID. You should see the following output on the
console:
Check bounded_response_rime productpage FAIL
Lets fix the buggy productpage service, rebuild and redeploy. We will add a 100ms timeout to API calls made to the reviews service.
cd gremlinsdk-python/exampleappOpen productpage/productpage.py in your favorite editor. Go to the getReview() function.
def getReviews(headers):
##timeout is set to 10 milliseconds
try:
res = requests.get(reviews['url'], headers=headers) #, timeout=0.010)
except:
res = None
if res and res.status_code == 200:
return res.text
else:
return """<h3>Sorry, product reviews are currently unavailable for this book.</h3>"""Uncomment the part related to
#timeout=0.010and integrate it into the get API call like below:
res = requests.get(reviews['url'], headers=headers, timeout=0.010)Save and close the file.
Rebuild the app.
cd gremlinsdk-python/exampleapp; ./rebuild-productpage.shRedeploy the app.
cd gremlinsdk-python/exampleapp; ./killall.sh;./runall.shLets rerun the previous gremlin recipe to check if the product page service
passes the test criterion. Repeat steps 2, 3 and 4. This time, even if
X-Gremlin-ID is present, the product page loads in less than 100ms,
and you should see
Sorry reviews are currently unavailable
You should also see the following console output during the behavior validation phase:
Check bounded_response_time productpage PASS
FYI: If you want to re-run the demo, you should revert the application to its
old setup and rebuild the docker containers. The undochanges.sh
helper script automates all of these tasks.
cd gremlinsdk-python/exampleapp; ./undochanges.shWhat we did above was to test an app for failure recovery, debugged it, fixed the issue, redeployed and tested again to ensure that the bug has been fixed properly. You could imagine automating the entire testing process above and integrating it into your build pipeline, so that you can run failure recovery tests just like your unit and integration tests.

