Circuit Breaker Pattern

When preparing for my presentation at NTK, the largest IT conference in Slovenia, titled Cloud Patterns: How to write applications for the cloud, I explored different architectural patterns that have to do with reliability of our cloud applications.

One of the most interesting patterns, in my opinion, that isn't often mentioned, is the Circuit Breaker pattern. Described in detail, in Michael Nygard's Release it! it basically does what it's electricity related sibling does. When it detects an anomaly (e.g. over-current) it opens, thus preventing any additional current to flow through until it is reset, manually or via some other mechanism.

External Services

One of the key topics of my presentation was decoupling our application from external dependencies. Most of our applications use an external service in one way or another. Unless the complete app, including its data is hosted in-memory, that is. But putting database servers aside for a moment, we also rely on other services, like Sendgrid, even Facebook. These services generally save us time (and therefore money), but they have an annoying tendency to fail. Granted, it doesn't happen, but when it does, it's out of our control.

There is no more "can fail", there is only "will fail"...

Yes, if you decouple your application, use queues, etc., a disruption in Sendgrid's service won't be a big problem. Sure, some users won't get their emails right away, but thanks to a properly implemented queue mechanism in our app, they will get it a bit later. But the problem is, that if the dependency is critical. For example, what if you use a payment gateway to process your credit cards? A disruption there will mean loss of revenue.

It is finally becoming common to implement Retry Policies in our code. So when a service goes down, what frequently happens is that our code will retry and retry and retry until the service is back up. Good? Yes, if it's a transient fault, absolutely not if the service is down due to, perhaps a DDoS attack. If we look at it abstractly, what happens in these cases is that there is a sudden "flux" of communication between us and the external service that generates errors. Ideally, we would need a mechanism that would monitor this communication, and when it detects such an anomaly, would shut it down for a certain period of time, or until manually reset.

The Cirucit Breaker

Enter, the Circuit Breaker. It basically does exactly what I described above. It sits in between our application and another endpoint (external or internal service) and goes off when there is an anomaly detected. It can be described as a three-state state machine:

Closed; the default state where all operations pertaining to the other service are executed normally. In this state, if an operation results in an exception/error, a failure counter is incremented. When the counter reaches a certain threshold, the circuit breaker transitions into
Opened state where no operation is allowed through at all. At the transition, a timer is started. When the timer elapses, or there is a manual reset, the circuit breaker transitions into an
Half-opened state. This is basically a probationary state, where the first operation is executed, and if it succeeds, the circuit breaker is returned to the closed state. If it fails it transitions back into an open state.

I hope you are starting to see the potential uses already - you don't have to flood the other server if it's down. But there are more things you can do. For example, you can query the circuit breaker for its state. And with that information, you can design something that is even more valuable to your users.

Switchable Repositories

In my talk, I proposed building a switchable repository. If you follow Martin Fowler, you'd be inclined to call it a switchable gateway. This applies, of course, if you are using an external service (like SendGrid). However, to be fair, I actually use the Services - Repository architecture most of the time, because technically, to my code, the external service is nothing more than the container for "some" data.

To really get the most out of this pattern, you should be using Dependency Injection (side note, I gave a talk about this a long, long time ago and will probably blog about it soon). What you do in this case is implement a switchable repository and inject both third party accessing repositories into it. You then tell the DI container that the circuit breaker is the default implementation of that repository. After that, it's basically just a glorified proxy.

The code for it is here.

You can download the entire sample application from this repository.

Using it

I envisioned using it with a DI framework, where you can bind an implementation of something like IDummyRepository to something akin to

SwitchableRepository<string>(new DummyRepositoryA(), new DummyRepositoryB());

It depends how you have implemented the repositories, and if they are generic enough for this to work, but this should give you a proper head start.

Understanding it

Let's take a deeper look into how the repository actually works in conjunction with the Circuit Breaker. All operations follow the same principal, but let's focus on the Get operation:

public TModel GetModel()
{
    var result = default(TModel);
    RetriableWrapper(() =>
    {
        result = CircuitBreaker.Execute(() => Repository.GetModel());
    });
    return result;
}

The trick here, is calling the RetriableWrapper method:

private void RetriableWrapper(Action action)
{
    var requestRetryPolicy = _retryPolicy.CreateInstance();
    var retryCount = 0;
    TimeSpan retryInterval = TimeSpan.FromSeconds(1);
    var opContext = new OperationContext();
    Exception previousException = null;
    do
    {
        try
        {
            Debug.WriteLine("Executing action, count: {0}", retryCount);
            action();
            return;
        }
        catch (Exception e)
        {
            retryCount++;
            previousException = e;
            Thread.Sleep(retryInterval);
        }
    } while (requestRetryPolicy.ShouldRetry(retryCount, 0, previousException, out retryInterval, opContext));
}

This method takes care of the grunt work. It creates a retry policy (note: this should, in production code, be handled elsewhere), and tries calling the actual action passed in through the parameter as long as the retry policy permits it. The Execute method of the circuit breaker is where the cool stuff happens:

private void InternalExecute(Action action)
{
    if (this.State == CircuitBreakerState.Open)
    {
        throw new OpenCircuitException("Circuit Breaker is open.");
    }


    try
    {
        //Debug.WriteLine("Executing CB action.");
        action();
        //Debug.WriteLine("CB success.");
    }
    catch (Exception e)
    {
        if (e.InnerException == null)
        {
            Debug.WriteLine("Inner Exception null exception occurred.");
            // called by the target of the invocation, re.throw
            // TODO: remove this, becuase we are testing with notimplemented
            // throw;
        }


        // TODO: we should check if the exception is blacklisted
        if (this.State == CircuitBreakerState.HalfOpened)
        {
            // trip immediately
            Trip();
        }
        else if (this._failureCount < _threshold)
        {
            Interlocked.Increment(ref this._failureCount);


            // we could raise a Service Level Changed event here, if we measured it :)
        }
        else if (this._failureCount >= this._threshold)
        {
            Trip();
        }


        throw new OperationFailedException("Operation failed", e.InnerException);
    }


    if (this.State == CircuitBreakerState.HalfOpened)
    {
        Reset();
    }


    if (this._failureCount > 0)
    {
        // we should only decrement, if measuring SLA
        _failureCount = 0;
    }
}

The last piece of the puzzle is the property "Repository":

private IRepository<TModel> Repository
{
    get
    {
        return _circuitBreaker.Open ? _backupRepository : _underlyingRepository;
    }
}

Basically, what I do, is call an internal helper with a delegate that calls the method (e.g. Get, Add, Remove, whatever...) on the Repository property, which returns the proper repository based on the state of the circuit breaker. Each call, however, is made through the circuit breaker though. Note, however, that the circuit breaker is also separated based on the repository being used, so there's an underlying property, called CircuitBreaker (I'm sorry about the un-original names):

private CircuitBreaker CircuitBreaker
        {
            get { return _circuitBreaker.Open ? _backupCircuitBreaker : _circuitBreaker; }
        }

The circuit breaker then executes the call, catches exceptions, and trips if the amount of failures was too high. In that case, it gives the service some time (defined in the TimeSpan timeout ) before switching to half-open state, where it behaves (to the outside) like it was closed again - if the call works at this time, it resets, if not, it trips back to open.

Wrapping it up

I hope this approach will be useful to someone out there. I use a derivative of this in my work, and I find it very helpful. The primary application is, as I've written in the beginning of this post, using it to switch to another external service, if one goes offline. The code on GitHub is open, use it if helps, let me know if you'd change something (or contribute!).