Trust is good, control is better: API Contract Testing
Have you ever heard the phrase ‘Everything can be automated’?
I have and have often used that phrase. Perhaps it’s not always applicable, but the perspective is valuable when there is an automation challenge. However, just because we *can* automate something, it does not mean we *should* automate it.
Now, my experience is that in enterprise organizations, there often is a lack of automation. For example, requesting an application server through a request process could mean that multiple tasks have to be completed, by different teams through manual activity.
Now here comes the pitfall: we start automating the process with whatever technology is available to us (because we can) without understanding the potential consequences.
After implementing the automation, ‘Day Two’ arrives, and our consumers start complaining that the automated process fails. Eventually, we find out that somewhere down the line some external system does not behave as expected.
Which begs the question: what do we expect from an external system, besides a happy flow behavior? Unless it’s clearly described, we have zero reliability.
I like to think of this as a lack of contracting.
Contracting
To me ‘contracting’ means that there are two parties that benefit from each other, and have an agreement.
One party provides a service or functionality through an interface. The behavior of the interface should be well documented (for example, API resource/method documentation). Also, the availability of the interface, the service itself, and expected resources should be well described.
The other party consumes the service, based on the contract description.
We must expect services to have enough quality and resilience to make sure the service is available and can deal with any unhappy flow situations. (for example: degradation of infrastructure).
Inherited responsibility
Looking at the example of the “application server request process”, I identify the challenge of “inherited responsibility”.
This means that when an external system fails (which causes the request process to fail) the consumer will likely complain to the owner of the request process service, even though it’s outside of their control.
Resilience
Contracts are often not perfect. We might settle with a compromise, for example:
- The service interface is not always available
- Service resources might have shared ownership and could be altered outside of our control.
- You might have to retry an operation on service resources for it to be effectuated
Just as long as the compromise is clear and described in the contract, we can build resilience in our system.
Contract testing
The trouble begins when external systems are not behaving as expected and become a risk. We will have to deal with it and perhaps mitigate the risk by increasing resilience in our system (retry the request, delay the operation). We also want to be alerted directly when an external service is misbehaving, instead of being alerted by our customers.
My approach to this challenge is *contract testing*.
A contract test is a test that runs periodically (depending on the risk) and engages with the external system just like our service would. When the test fails, we are alerted, and the resulting metrics are shipped off to a metrics dashboard.
Implementation
In most cases, a contract test is easy to implement. For example, there is a function in our system that registers a CI in an external Configuration Management system through an API.
Instead of creating a new function for our test, we can reuse the code and invoke it on a schedule with an arbitrary CI id. The function will try to register the “test” CI, and when it fails, the contract test can alert us.
I recommend only testing the consumption of the external service. For example, do not try to ping the server infrastructure, because it might change. (Also, we are not responsible for monitoring the infrastructure of an external service)
In conclusion
- For each connection with an external service, there should be a clear contract, that is agreed upon.
- If external systems are deemed unreliable, create a contract test by invoking the exact code that is used in the internal system.
Now that we have insights and metrics, we can decide on how to deal with it. We could have discussions with the service provider, build resilience, or even get rid of it.
What are your thoughts? Perhaps you have a different approach? Let me know in the comments.