.NET Nakama

Improving your .NET skills

Resilience APIs to Transient Faults using Polly

August 04, 2022 (~18 Minute Read)
WEB API TRANSIENT FAULTS CIRCUIT-BREAKER RETRIES POLLY RESILIENCY

Introduction

In our previous article (.NET Nakama, 2022 July), we saw that Transient faults are inevitable temporary errors (especially in microservice and cloud-based applications). An API is Resilient when it can recover from transient failures and continue functioning (in a way that avoids downtime or data loss). Therefore, we have learned the main strategies to handle transient faults. Transient fault handling may seem complicated, but libraries like Polly can simplify it.

In this article, we will use the Polly library to apply and combine the Retries, Circuit-Breaker, Network Timeout, and Fallbacks strategies. In our use case, we need to execute an action in our Web API (GetData), which retrieves and merges data from two third-party providers (via HTTP) to return them to our consumers (clients), as shown in Figure 1.

Our Web API use case example with two providers (third-party).
Figure 1. - Our Web API use case example with two providers (third-party).

The Polly library and Resilience Policies

Polly is a NET resilience and transient-fault-handling library. Simply put, it provides a quick way to define and apply strategies (policies) to handle transient faults. Polly targets .NET Standard 1.1 and .NET Standard 2.0+, which means that it can be used (everywhere!):

  • .NET Framework 4.6.1 and 4.7.2
  • .NET Core 2.0+, .NET Core 3.0, and later.

The Polly NuGet library can be installed via the NuGet UI or the NuGet package manager console:

PM> Install-Package Polly

The strategies that we learned in .NET Nakama (2022, July) and we will use in the current article have an equivalent Polly policy, as we can see in the following table.

Table 1. - The Polly Policies-Strategies to Handle Transient Faults

Strategy Polly Policy Description
Retries Retry Policy (with and without waiting) Let’s retry after a short delay. Then, maybe the fault will be self-correct.
Circuit-Breaker Circuit Breaker Policy When a system is seriously struggling, failing fast is better to give that system a break.
Network Timeout Timeout Policy Don’t wait forever! Beyond a specific waiting time, a successful result is unlikely and worthless.
Fallbacks Fallback Policy Things will still fail! Plan what you will do when that happens.
Combination of Multiple Strategies Policy Wrap Different faults require different strategies. By combining multiple policies we increase the resiliency.

Using Polly in 3 Steps

Step 1: Specify the Faults That the Policies Will Handle

We need to apply some policies (strategies) to handle transient faults. So, our first step is to define how to recognize these faults, which can be performed either from:

  • Exceptions thrown (such as HttpRequestException, Exception, etc.), or
  • Returned Results (specifying the fault, e.g., in a related property such as Status, ErrorCode, etc.). In this case, we assume that the exceptions are handled with a try-catch, and a corresponding result will be returned.

Handle Thrown Exceptions

In the following example code, we will handle the HttpRequestException and OperationCanceledException exceptions.

PolicyBuilder policyBuilder = Policy
  .Handle<HttpRequestException>()
  .Or<OperationCanceledException>();

In the following example code, we will handle all Exceptions. In addition, we state that our execution code and our fallback policies will return a nullable MyCodesResponse class. In this way, we can define policies for execution code that are not void (i.e., returns something).

PolicyBuilder<MyCodesResponse?> policyBuilder = Policy<MyCodesResponse?>
  .Handle<Exception>();

Handle Returned Results

In the following example code, we will get a MyResponseDTOClass object and handle the cases in which the MyStatusCode property is either InternalServerError or BadGateway.

var policyBuilder = Policy
  .HandleResult<MyResponseDTOClass>(r => r.MyStatusCode == StatusCode.InternalServerError)
  .OrResult(r => r.StatusCode == StatusCode.BadGateway);

Step 2: Specify How the Policy Should Handle the Faults

In this step, we define the policies (scenarios) with their thresholds and how we will combine them. The following code samples show how we can define policies based on the policyBuilder of Step 1. However, there are cases, such as the TimeoutPolicy, in which we should use the Polly static methods. To learn more about the different contractors of each policy, see the Polly documentation.

// Retry three times
RetryPolicy retryPolicy = policyBuilder
  .Retry(3);
// Retry five times and use a function to calculate the duration to wait between retries
// based on the current retry attempt (allows for exponential back-off).
RetryPolicy retryPolicy = policyBuilder
.WaitAndRetry(5, retryAttempt =>
  TimeSpan.FromSeconds(Math.Pow(2, retryAttempt))
);
// Break the circuit after 2 consecutive exceptions, and keep the circuit broken for 1 minute.
CircuitBreakerPolicy breakerPolicy = policyBuilder.
  CircuitBreaker(2, TimeSpan.FromMinutes(1)
);
// Timeout and return to the caller after 5 seconds.
TimeoutPolicy timeoutPolicy = Policy.Timeout(5);

Step 3: Execute Code through the Policy

It’s time to apply the policy in the code that communicates with the third-party provider. We used the circuit breaker policy in the following examples to execute the ThirdPartyProviderCommunication() function. In addition, we can see how we can get a response.

// Execute an action
breakerPolicy.Execute(() => {
  ThirdPartyProviderCommunication();
});
// Execute an action and get the response.
MyCodesResponse? myRespose = return breakerPolicy.Execute(() => {
  return ThirdPartyProviderCommunication();
});

Handle Transient Faults with Polly Policies

Policy Objects VS HttpClient Factory

To use Polly, we have two options. We can use Policy Objects or the HttpClient factory (ASPNET Core 2.1 onwards). As we can understand from the naming:

  • Policy Objects: Can be used everywhere we want to apply the Polly policies.
  • HttpClient Factory: Add Polly policies directly on the HttpClient Factory to be applied to every outgoing call.

The HttpClient factory (IHttpClientFactory) in ASPNET Core can be registered and used to pre-configure and create HttpClient instances in an application. The IHttpClientFactory offers additional benefits than using the HttpClient directly. If you are interested, the Larkin K., et al. (2022, June 29) describes how we can use the IHttpClientFactory (Basic usage, Named clients, etc.).

We aim to apply the Polly policies in the communication code with our third-party providers. In our case, the communication is performed via HTTP. So, both options are applicable. However, this will not always be the case. For example, different communication protocols or existing HTTP client libraries may be used that does not support Polly by default. For such cases, we can select the Policy Objects option.

In this article, we will start from the basics and use the Policy Objects to apply Polly policies, which we can use everywhere.

The Tutorial Project

For the sake of our use case, we implemented two dummy providers (ProviderExampleApi1 & ProviderExampleApi2), which return random weather forecasts. In addition, we can define their execution delay and error response in each API request to simulate the transient faults.

The WebApiPolly project represents our API which provides endpoints to simulate the different transient-fault scenarios. These endpoints communicate with our two providers to retrieve and combine the available results. For that purpose, two separate services have been implemented (assuming that each provider has a different API contract). In our example, we have applied Polly policies only to the integration of Provider2 (Provider2Integration.cs).

In the following sections we will see in detail:

  • How we defined each policy (as async).
  • How we combined them, and
  • Simulation transient-fault scenarios to investigate our system’s behavior.

Basic Structure

In the Provider2Integration.cs file, we can see how we implemented the HTTP communication and applied the Polly policies. Let’s see some important details here:

  • We have registered the IProvider2Integration with Transient lifetime in the Dependency Injection (DI). You can decide how to register your services depending on your project requirements. For a better understanding of Dependency Injection and Lifetime, read the .NET Nakama (2020, November) article.
  • The HttpClient is static. The HttpClient is intended to be instantiated once per application rather than per use (.NET 6.0 Documentation).
  • Our policy object is static (AsyncPolicyWrap ). This object contains the information that is needed from the policies. For example, we might need to store the consecutive errors. As we can understand, we cannot instantiate this data per request.

Handle All Exceptions

We will handle all Exceptions and define that our execution code and fallback policies will return a nullable Provider2GetResponse class.

var policyBuilder = Policy<Provider2GetResponse?>
  .Handle<Exception>();

Fallbacks Policy

We intend to use several policies to reduce and handle transient faults. However, there will be actions that will still fail. Using a fallback policy, we plan what we will do in those cases. In the following example, we will log (in console) these cases and return a null value. We could return a default or substitute value depending on each use case.

// Fallback policy:
// Provide a default or substitute value if an execution faults.
var fallbackPolicy = policyBuilder
  .FallbackAsync((calcellationToken) =>
  {
    // In our case we return a null response.
    Console.WriteLine($"{DateTime.Now:u} - Fallback null value is returned.");
 
    return Task.FromResult<Provider2GetResponse?>(null);
  });

Retry Policy with Exponential Backoff

In this policy, we will retry the failed executions for maxRetries (e.g. 2) and wait between the retries for a duration calculated based on the number of retry attempts. So, if we set the max retries with a value of two, the maximum executions would be three (initial execution + two retries). In this example, we are using a simple function (waitTime = 2 ^ retryAttempt) to calculate the waitTime.

  • 2 ^ 1 = 2 seconds
  • 2 ^ 2 = 4 seconds
  • 2 ^ 3 = 8 seconds
  • etc.
// Wait and Retry policy:
// Retry with exponential backoff, meaning that the delay between
// retries increases as the number of retries increases.
var retryPolicy = policyBuilder
  .WaitAndRetryAsync(maxRetries, retryAttempt =>
  {
    var waitTime = TimeSpan.FromSeconds(Math.Pow(2, retryAttempt));
    Console.WriteLine($"{DateTime.Now:u} - RetryPolicy | Retry Attempt: {retryAttempt} | WaitSeconds: {waitTime.TotalSeconds}");

    return waitTime;
  });

Circuit-Breaker Policy

In this circuit-breaker policy, we break the circuit after breakCurcuitAfterErrors consecutive exceptions and keep the circuit broken for keepCurcuitBreakForMinutes minutes. In addition, we are defining what to do when the circuit state changes to open (onBreak) and when the circuit state changes to closed (onReset). In our case, we are keeping an informational console log.

It is essential to notice that we have used an additional fallback policy for the circuit-breaker to handle the BrokenCircuitException, keep a related log, and return an alternative response. We needed this because we would like to stop the repeat policy when the circuit is opened (blocked).

// Break the circuit after 6 consecutive exceptions and keep circuit broken for 1 minute.
var breakerPolicy = policyBuilder
    .CircuitBreakerAsync(breakCurcuitAfterErrors, TimeSpan.FromMinutes(keepCurcuitBreakForMinutes),
    onBreak: (exception, timespan, context) =>
    {
      // OnBreak, i.e. when circuit state change to open
      Console.WriteLine($"{DateTime.Now:u} - BreakerPolicy | State changed to Open (blocked).");
    },
    onReset: (context) =>
    {
      // OnReset, i.e. when circuit state change to closed
      Console.WriteLine($"{DateTime.Now:u} - BreakerPolicy | State changed to Closed (normal).");
    });

// Optional: Handle the "BrokenCircuitException" to keep a related log or/and return an alternative response.
var fallbackPolicForCircuitBreakery = Policy<Provider2GetResponse?>
  .Handle<BrokenCircuitException>()
  .FallbackAsync((calcellationToken) =>
  {
    // In our case we return a null response.
    Console.WriteLine($"{DateTime.Now:u} - The Circuit is Open (blocked) for this Provider. A fallback null value is returned. Try again later."); 

    return Task.FromResult<Provider2GetResponse?>(null);
  });

Network Timeout Policy

In our example, we are using an HttpClient in which we can set the Timeout. However, this would not always be the case. We may communicate using a client that does not support Timeout. In such cases, we can use the Polly timeout. In our example, we will timeout after timeoutInSeconds and write a related log. The TimeoutStrategy has the following two options:

  • Optimistic: The called code honors the CancellationToken and cancels when needed.
  • Pessimistic: The called code may not honor theCancellationToken.
private static readonly HttpClient _httpClient = new()
{
  // Network HTTP timeout in 2 seconds:
  Timeout = TimeSpan.FromSeconds(2)
};
// Timeout Policy:
var timeoutPolicy = Policy
  .TimeoutAsync<Provider2GetResponse?>(timeoutInSeconds, TimeoutStrategy.Pessimistic,
  onTimeoutAsync: (context, timespan, _, _) =>
  {
    Console.WriteLine($"{DateTime.Now:u} - TimeoutPolicy | Execution timed out after {timespan.TotalSeconds} seconds.");
    return Task.CompletedTask;
  });

Policy Wrap

The Polly policies can be combined in any order using a PolicyWrap. However, we should consider the ordering points that are described in the Polly documentation. In the following example, we combined all the studied policy strategies based on the typical policy ordering.

// Define the combined policy strategy:
return Policy.WrapAsync(
  fallbackPolicy,
  retryPolicy,
  fallbackPolicForCircuitBreaker,
  breakerPolicy,
  timeoutPolicy);

Transient-Fault Scenarios Simulation

The tutorial project is configured as “Multiple Startup Projects” to start the two example providers and our main Web API project together. So, you just need to click the Start button as shown in Figure 2.

Start the tutorial projects.
Figure 2. - Start the tutorial projects.

The following table shows the endpoints that simulate the different transient-fault scenarios and their names in the provided Postman collection. We can find the complete code of the tutorial on GitHub and in the Postman collection to test it quickly.

Postman Request Name API GET Endpoints
Happy Path Scenario: No errors https://localhost:7083/weatherforecasts
Continuous-Failures (Provider 2 is down) https://localhost:7083/weatherforecasts/continuous-failures
Timeout-Errors (Provider 2 delay to respond) https://localhost:7083/weatherforecasts/timeout-errors
Transient-Faults (Random errors or/and delays on Provider 2) https://localhost:7083/weatherforecasts/transient-faults

Continuous Exceptions and Timeouts Scenarios

To test the retry and fallback policies, we can send the Continuous-Failures and the Timeout-Errors requests and investigate the produced console logs. For example, in the following figures, we can see:

  • All executions fail either by general exception or timeout error (Figures 3 & 4).
  • The initial execution and the two retries (Figures 3 & 4).
  • The fallback policy returns a null value when the communication is not possible (Figures 3 & 4.
  • The circuit-breaker policy opened (blocked) the circuit on the 6th consecutive failed execution (Figure 5).
  • The circuit remained open for one minute (as configured) and did not accept messages to give that system a break (Figure 5).
  • After one minute, one execution was attempted, and because a failure occurred, it opened the circuit again for another minute (Figure 5).
Execution logs of the Continuous-Failures endpoint.
Figure 3. - Execution logs of the Continuous-Failures endpoint.
Execution logs of the Timeout-Errors endpoint.
Figure 4. - Execution logs of the Timeout-Errors endpoint.
Execution logs for Circuit-Breaker policy.
Figure 5. - Execution logs for Circuit-Breaker policy.

Transient-Fault Scenarios

The Transient-Faults endpoint produce random errors or/and delays. In the following figure, we can see an execution example, in which the first two executions failed (due to error and timeout). However, the third attempt was successful. Thus, in this request the client received the results.

Execution logs of a random transient-fault scenario.
Figure 6. - Execution logs of a random transient-fault scenario.

Summary

Transient fault handling may seem complicated, but libraries like Polly can simplify it. This article teaches the three basic steps to use the Polly library. In addition, we applied and combined the Retries, Circuit-Breaker, Network Timeout, and Fallbacks policies to improve the resiliency of our Web API.

Using the provided source code and Postman collection, we simulated continuous and random failures (exceptions or/and timeouts). Finally, we investigated our system’s behavior by applying the Polly policies. As we saw, combining these policies provides a powerful tool that reduces and handles transient faults to provide resilient APIs.

References

If you liked this article (or not), do not hesitate to leave comments, questions, suggestions, complaints, or just say Hi in the section below. Don't be a stranger 😉!

Dont't forget to follow my feed and be a .NET Nakama. Have a nice day 😁.