Observability in Wolverine

Some colleagues and I are working on the technical modernization of a large system at work, and lately we’ve been talking quite a bit about the observability we want baked into the system going forward. To share a little bit of context about this particular system, it’s…

Very distributed, with asynchronous messaging between a dozen or so services plus calls to another half dozen external web services
It’s multi-tenanted
Performance and throughput needs to be carefully monitored and we should absolutely be concerned with customer specific SLAs

And of course, we’re also dogfooding Wolverine as our messaging framework and local mediator as part of our modernization work. One of my specific goals for the usability of Wolverine in grown up development usage is to provide all the necessary observability (error logging, activity logging, performance metrics) for production monitoring and troubleshooting. But also, bake that completely into Wolverine itself in such a way that developers don’t have to think too much about that and certainly don’t have to write very much repetitive boilerplate code in their daily work to get there.

To start a conversation about what Wolverine can already do, check out this partial application bootstrapping code from a demonstrator project I wrote to test out Wolverine’s Open Telemetry and metrics collection exports — in this case using Honeycomb to visualize and analyze the exported Open Telemetry data:

var host = Host.CreateDefaultBuilder(args)
    .UseWolverine((context, opts) =>
    {
        opts.ServiceName = "Metrics";

        // Open Telemetry *should* cover this anyway, but
        // if you want Wolverine to log a message for *beginning*
        // to execute a message, try this
        opts.Policies.LogMessageStarting(LogLevel.Debug);
        
        // For both Open Telemetry span tracing and the "log message starting..."
        // option above, add the AccountId as a tag for any command that implements
        // the IAccountCommand interface
        opts.Policies.Audit<IAccountCommand>(x => x.AccountId);
        
        // Setting up metrics and Open Telemetry activity tracing
        // to Honeycomb
        var honeycombOptions = context.Configuration.GetHoneycombOptions();
        honeycombOptions.MetricsDataset = "Wolverine:Metrics";
        
        opts.Services.AddOpenTelemetry()
            // enable metrics
            .WithMetrics(x =>
            {
                // Export metrics to Honeycomb
                x.AddHoneycomb(honeycombOptions);
            })
            
            // enable Otel span tracing
            .WithTracing(x =>
            {
                x.AddHoneycomb(honeycombOptions);
                x.AddSource("Wolverine");
            });

    })
    .UseResourceSetupOnStartup()
    .Build();

await host.RunAsync();

I’ve opted into a few optional things up above to export the Wolverine open telemetry tracing and metrics to Honeycomb. I’ve also opted to have Wolverine inject extra logging messages into the generated message handlers to log Debug level messages denoting the start of message processing.

Open Telemetry activities are automatically logged for all messages sent, received, and message execution. Significant events within Wolverine like message execution success, message failure exceptions, and messages moving to dead letter queues are logged through .NET’s standard ILogger abstraction with structured logging in mind. In all cases, if the log message is related to a specific message, the message’s correlation identifier that would point to Open Telemetry spans is written into the log message. In a future edition, Wolverine will make a larger investment in Open Telemetry and utilize Activity Events in addition to old fashioned logging.

Wolverine Metrics

Wolverine is automatically tracking several performance related metrics through the System.Diagnostics.Metrics types, which sets Wolverine users up for being able to export their system’s performance metrics to third party observability tools like Honeycomb or Datadog that support Open Telemetry metrics. The current set of metrics in Wolverine are shown below:

Metric Name	Metric Type	Description
wolverine-messages-sent	Counter	Number of messages sent
wolverine-execution-time	Histogram	Execution time in milliseconds
wolverine-messages-succeeded	Counter	Number of messages successfully processed
wolverine-dead-letter-queue	Counter	Number of messages moved to dead letter queues
wolverine-effective-time	Histogram	Effective time between a message being sent and being completely handled in milliseconds. Right now this works between Wolverine to Wolverine application sending and from NServiceBus applications sending to Wolverine applications through Wolverine’s NServiceBus interoperability.
wolverine-execution-failure	Counter	Number of message execution failures. Tagged by exception type
wolverine-inbox-count	Observable Gauge	Samples the number of pending envelopes in the durable inbox (likely to change)
wolverine-outbox-count	Observable Gauge	Samples the number of pending envelopes in the durable outbox (likely to change)
wolverine-scheduled-count	Observable Gauge	Samples the number of pending scheduled envelopes in the durable inbox (likely to change)

Wolverine Metrics

In all cases, the metrics are tagged by message type. In the case of messages sent, succeeded, or failed, Wolverine is also tagging the metrics by the message destination (Rabbit MQ / Azure Service Bus / AWS SQS queue etc.).

In addition, you can add arbitrary tagging to the metrics. Taking an example inspired by something I know we’re going to want in my own company, let’s say that we want to tag the performance metrics by the business organization related to the message such that we could do a break down of system throughput and performance by organization.

First off, let’s say that we have an interface type like this that we can use to let Wolverine know that a message is related to a particular business organization:

public interface IOrganizationRelated
{
    string OrganizationCode { get; }
}

Next, I’ll write a simple Wolverine middleware type to add a metrics tag to the metric data collection:

public static class OrganizationTaggingMiddleware
{
    public static void Before(IOrganizationRelated command, Envelope envelope)
    {
        envelope.SetMetricsTag("org.code", command.OrganizationCode);
    }
}

and finally add that middleware to our system against all handlers where the message type implements IOrganizationRelated:

        using var host = await Host.CreateDefaultBuilder()
            .UseWolverine(opts =>
            {
                // Add this middleware to all handlers where the message can be cast to
                // IOrganizationRelated
                opts.Policies.AddMiddlewareByMessageType(typeof(OrganizationTaggingMiddleware));
            }).StartAsync();

To get Wolverine ready for a more formal, built in way to handle multi-tenancy, a recent version introduced formal TenantId tracking on Wolverine messages (this will be improved in Wolverine 0.9.16). The TenantId — if it exists — will also be tagged into the metrics (and Open Telemetry activity/span tracking) as “tenant.id”.

Here’s a possible usage of this:

    public static async Task publish_operation(IMessageBus bus, string tenantId, string name)
    {
        // All outgoing messages or executed messages from this 
        // IMessageBus object will be tagged with the tenant id
        bus.TenantId = tenantId;
        await bus.PublishAsync(new SomeMessage(name));
    }

As of the forthcoming 0.9.16, the TenantId value will be automatically propagated through any messages sent as a response to the original messages tagged with TenantId values.

Some More Thoughts on Open Telemetry

I’m personally wanting to make a big bet on Open Telemetry tracing and monitoring going forward. As much as possible, I want us to use out of the box tools to integrate Open Telemetry tracking for performance monitoring for operations like outgoing web service calls (through integration with HttpClient) or Wolverine’s own tracking.

Moreover, we currently have a great deal of repetitive code to support our robust logging strategy, and while having effectively instrumented code is certainly valuable, I find that the coding requirements detract from the readability of the code and often act as a deterrent against evolving the system. I’d like to get to the point where our developers spend very little time having to explicitly write instrumentation code within our systems and the actual functional code is easier to read and reason about by eliminating noise code.