Strategies to Handle Transient Faults in Web APIs
Web APIs commonly communicate with remote services and resources, and thus, it is inevitable that transient faults will eventually occur. Transient faults are errors expected to be temporary (partial failures), such as temporary service unavailability, network connectivity issues, etc. Transient faults are often self-corrected by retrying the action after a short delay. However, there are additional strategies that we can follow to reduce and handle transient faults to provide resilient APIs.
We should accept that transient faults will happen, especially in microservice and cloud-based applications. An API is resilient when it can recover from failures (in a way that avoids downtime or data loss) and continue functioning.
This article will teach us the main strategies to handle transient faults by example.
Let’s assume that in our use case, we need to execute an action in our Web API, which retrieves data from one or more third-party providers (e.g., via HTTP) to return to our consumers (clients), as shown in Figure 1.
We aim to design a system that can continue functioning in an environment where transient failures are inevitable. So, let’s see the basic strategies we can apply to handle transient faults. Depending on our project architecture, these strategies can be used individually or in combination.
This may be the most common strategy in which we retry to execute the action after a short delay. The rationale is that when the problem is transient, the action will succeed in a subsequent attempt (self-corrected).
However, this carries risks because we may overwhelm the recipient system with requests (while it already has issues), resulting in producing additional systems to failure (ripple effect) or even causing a Denial-of-Service (DoS).
So, depending on each case, we could retry with exponential backoff, meaning that the delay between retries increases as the number of retries increases (Figure 2). In addition, we could combine the retries with circuit breakers (see the following section) to stop the communication attempts for a while.
In our case study, we want to retrieve data from multiple providers (e.g., GET HTTP method) that is Safe and Idempotent. However, when applying retries on write operations (e.g., POST and PATCH HTTP methods), we have to be sure that these actions are idempotent. Otherwise, we may result in creating multiple resources due to retries. An easy way to apply idempotency in your APIs is using the IdempotentAPI open-source NuGet library. You can read more by reading these articles.
In the circuit-breaker strategy, we perceive communication as a circuit (see Figure 3). When the circuit is closed (normal scenario), communication is allowed. On the contrary, the execution is blocked when the circuit is open.
To apply the circuit-breaker pattern, we should track the number of failed requests and open the circuit (blocking the communication) based on some criteria. The criteria can be calculated based on our use case requirements. For example, we could open the circuit when the
error rate or the
number of consecutive failures exceeds a configured limit. It would be preferable to make these limits configurable, so we can perform easy adjustments and optimize them for your case.
We assume that the service is probably unavailable or has high traffic when many requests fail. Thus, not making additional requests would be preferable to give that system a break. Then, after a defined time, we would send a request (trying to communicate), and if it’s successful, the circuit will be closed, restoring the communication.
For our use case, we could use a circuit breaker for each third-party provider. Thus, we would recognize and open the circuits of the non-responding providers (Figure 4). This way, we will continue responding with the available providers’ data, avoid the ripple effect, and continue operating.
In our use case, we need to communicate (via the network) with third-party providers to retrieve data. The question that arises is how long we would wait for the responses and how long will our consumers wait for our response. As we can understand, we cannot wait forever, and waiting for a long time may be worthless because the customer (final user) will have already closed our app (web, native, etc.).
In general, we should always use timeouts when waiting for a response. Therefore, we should define the timeout duration based on each use case. However, it is crucial to set a small timeout when we use retries because the total timeout would be the number of retries multiplied by the request timeout.
TotalTimeout = NumberOfRetries * RequestTimeout
Caching is storing data (in a local or distributed cache) to be served faster on future requests rather than retrieving them from a data store or a service on each request (Figure 5). Applying caching to an API that relies heavily on data can improve its performance and reduce network traffic. Thus, caching is not a direct strategy to handle transient faults. However, caching can reduce transient faults, and we can combine it with other strategies such as Fallbacks.
Caching systems provide read-through or/and write-through caching.
- Write-Through Caching: The data are automatically retrieved from the data store and stored in the cache when the cache is empty or the data are staled (expired). This is a good strategy for relatively static data or data that is read frequently (e.g., data for all users such as products, categories, etc.).
- Read-Through Caching: Also known as a cache-aside pattern, the data are retrieved on demand from the data store (e.g., on a user’s request) when the cache is empty or the data are staled (expired). This is a good strategy for general caching purposes to reduce the traffic to the data store.
The purpose of a Web API is to provide functionalities to many consumers. Unfortunately, due to a false implementation or malicious purposes, the consumers could perform many requests to the Web API and monopolize its resources. This can result in decreasing the Web API’s performance and increasing the transient faults.
Rate-limiting can be used to control the rate of an operation execution (e.g., requests sent or received) within a rolling time window. In our example, we could assign a rate limit to each user to fairly share the resources.
When a rate limit is exceeded, the Web APIs commonly return an HTTP 429 - Too Many Requests response status code indicating that the consumer has sent too many requests in the current time window. Additionally, a
Retry-After header can be included in the response to indicate the time to wait before the consumer can perform a new request.
In our case study, we are communicating with the API of each third-party provider. When a provider-service starts to fail, has an increased delay, etc., many requests may be piled to be executed and potentially fail. These requests could consume our Web API server’s resources (CPU, threads, memory, etc.), leading to cascading failures.
In the Bulkhead strategy, the application’s components (e.g., the provider services) are isolated into pools to tolerate failures. Thus, if one component fails, the others will continue to function. The same strategy is used in the ship’s hull, which is partitioned into sections (bulkheads). If a ship’s hull is compromised, then only the damaged section will be filled with water, preventing the ship from sinking.
In a Bulkhead strategy, we commonly limit per component:
- The number of concurrent calls (max running) and,
- The number of queued requests (max pending).
The requests will be queued when the maximum concurrent calls are reached. When the queue capacity is exceeded, the Bulkhead strategy will reject the new requests.
We have seen several strategies we can apply to reduce and handle transient faults. However, there will be actions that will still fail. I am not a pessimist! Transient faults are inevitable 😃.
So, we need to plan what we will do in those cases. For example, on a read operation (e.g., GET HTTP method), we could return cached data (even staled) or a default value. The fallback scenarios are highly dependent on the nature of each use case, so we have to see our options to decide. The most common fallback is logging and returning the error (especially for write operations).
Transient faults are errors expected to be temporary (partial failures), such as temporary service unavailability, network connectivity issues, etc. We aim to design systems that can continue functioning in an environment where transient failures are inevitable.
We can use several strategies individually or in combination to reduce transient failures and handle them to continue operating. However, there will be actions that will still fail, for which we should define what to do (fallback).
When applying retries on write operations (e.g., POST and PATCH HTTP methods), we have to be sure that these actions are idempotent. An easy way to apply idempotency in our APIs is using the IdempotentAPI open-source library.
In the following article, we will use the Polly library to apply the strategies that we learned in this article. So, stay tuned!