Building a Critter Stack Application: Resiliency

Hey, did you know that JasperFx Software is ready for formal support plans for Marten and Wolverine? Not only are we trying to make the “Critter Stack” tools be viable long term options for your shop, we’re also interested in hearing your opinions about the tools and how they should change. We’re also certainly open to help you succeed with your software development projects on a consulting basis whether you’re using any part of the Critter Stack or any other .NET server side tooling.

Let’s build a small web service application using the whole “Critter Stack” and their friends, one small step at a time. For right now, the “finished” code is at CritterStackHelpDesk on GitHub.

The posts in this series are:

  1. Event Storming
  2. Marten as Event Store
  3. Marten Projections
  4. Integrating Marten into Our Application
  5. Wolverine as Mediator
  6. Web Service Query Endpoints with Marten
  7. Dealing with Concurrency
  8. Wolverine’s Aggregate Handler Workflow FTW!
  9. Command Line Diagnostics with Oakton
  10. Integration Testing Harness
  11. Marten as Document Database
  12. Asynchronous Processing with Wolverine
  13. Durable Outbox Messaging and Why You Care!
  14. Wolverine HTTP Endpoints
  15. Easy Unit Testing with Pure Functions
  16. Vertical Slice Architecture
  17. Messaging with Rabbit MQ
  18. The “Stateful Resource” Model
  19. Resiliency (this post)

Sometimes, things go wrong in production. For any number of reasons. But all the same, we want to:

  • Protect the integrity of our system state
  • Not lose any ongoing work
  • Try not to require manual interventions to put things right in the system
  • Keep the system from going down even when something is overloaded

Fortunately, Wolverine comes with quite a few facilities for adding adaptive and selective resiliency to our systems — especially when doing asynchronous processing.

First off, we’re using Marten in our incident tracking, help desk system to read and persist data to a PostgreSQL database. When handling messages, Wolverine could easily encounter transient (read: random and not necessarily systematic) exceptions related to network hiccups or timeout errors if the database happens to be too busy at that very time. Let’s tell Wolverine to apply a little exponential backoff (close enough for government work) and retry a command that hits one of these transient database errors a limited number of times like this within the call to UseWolverine() within our Program file:

    // Let's build in some durability for transient errors
    opts.OnException<NpgsqlException>().Or<MartenCommandException>()
        .RetryWithCooldown(50.Milliseconds(), 100.Milliseconds(), 250.Milliseconds());

The retries may happily catch the system at a later time when it’s not as busy, so the transient error doesn’t reoccur and the message can succeed. If we get successive failures, we wait longer before retries. This retry policy effectively throttles a Wolverine system and may give a distressed subsystem within your architecture (in this case the PostgreSQL database) a chance to recover.

Other times you may have a handler encounter an exception that tells us the message in question is invalid somehow, and could never be handled. There’s absolutely no reason to retry that message, so instead, let’s tell Wolverine to instead discard that message immediately (and not even bother to move it to a dead letter queue):

    // Log the bad message sure, but otherwise throw away this message because
    // it can never be processed
    opts.OnException<InvalidInputThatCouldNeverBeProcessedException>()
        .Discard();

I’ve done a few integration projects now where some kind of downstream web service was prone to being completely down. Let’s pretend that we’re only calling that web service through a message handler (my preference whenever possible for exactly this failure scenario) and can tell from an exception that the web service is absolutely unavailable and no other messages could possibly go through until that service is fixed.

Wolverine can do that as well, like so:

    // Shut down the listener for whatever queue experienced this exception
    // for 5 minutes, and put the message back on the queue
    opts.OnException<MakeBelieveSubsystemIsDownException>()
        .PauseThenRequeue(5.Minutes());

And finally, Wolverine also has a circuit breaker functionality to shut down processing on a queue if there are too many errors in a certain time. This feature certainly applies to messages coming in from external messages from Rabbit MQ or Azure Service Bus or AWS SQS, but can also apply to database backed local queues. For the help desk system, I’m going to add a circuit breaker to the local queue for processing the TryAssignPriority command to pause all local processing on the current node if a certain threshold of message processing is failing:

    opts.LocalQueueFor<TryAssignPriority>()
        // By default, local queues allow for parallel processing with a maximum
        // parallel count equal to the number of processors on the executing
        // machine, but you can override the queue to be sequential and single file
        .Sequential()

        // Or add more to the maximum parallel count!
        .MaximumParallelMessages(10)

        // Pause processing on this local queue for 1 minute if there's
        // more than 20% failures for a period of 2 minutes
        .CircuitBreaker(cb =>
        {
            cb.PauseTime = 1.Minutes();
            cb.SamplingPeriod = 2.Minutes();
            cb.FailurePercentageThreshold = 20;
            
            // Definitely worry about this type of exception
            cb.Include<TimeoutException>();
            
            // Don't worry about this type of exception
            cb.Exclude<InvalidInputThatCouldNeverBeProcessedException>();
        });

And don’t worry, Wolverine won’t lose any additional messages published to that queue. They’ll just sit in the database until the current node picks back up on this local queue or another running node is able to steal the work from the database and continue.

Summary and What’s Next

I only gave some highlights here, but Wolverine has some more capabilities for error handling. I think these policies are probably something you adapt over time as you learn more about how your system and its dependencies behave. Throwing more descriptive exceptions from your own code is definitely beneficial as well for these kinds of error handling policies.

I’m almost done with this series. I think the next post or two — and it won’t come until next week — will be all about logging, auditing, metrics, and Open Telemetry integration.

One thought on “Building a Critter Stack Application: Resiliency

  1. Thanks again for these interesting posts. Can you give an example on how your dealing with “Let’s pretend that we’re only calling that web service through a message handler (my preference whenever possible for exactly this failure scenario)” within Wolvering?
    Thanks

Leave a comment