Stream Compacting in Marten 8.0

One of the earliest lessons I learned designing software systems is that reigning in unchecked growth of databases through judicious pruning and archiving can do wonders for system performance over time. As yet another tool in the toolbox for scaling Marten and in collaboration with a JasperFx Software customer, we’re adding an important feature in Marten 8.0 called “Stream Compacting” that can be used to judiciously shrink Marten’s event storage to keep the database a little more limber as old data is no longer relevant.

Let’s say that you failed to be omniscient in your event stream modeling and ended up with a longer stream of events than you’d ideally like and that is bloating your database size and maybe impacting performance. Maybe you’re going to be in a spot where you don’t really care about all the old events, but really just want to maintain the current projected state and more recent events. And maybe you’d like to throw the old events in some kind of “cold” storage like an S3 bucket or [something to be determined later].

Enter the new “Stream Compacting” feature that will come with Marten 8.0 next week like so:

public static async Task compact(IDocumentSession session, Guid equipmentId, IEventsArchiver archiver)
{
    // Maybe we have ceased to care about old movements of a piece of equipment
    // But we want to retain an accurate positioning over the past year
    // Yes, maybe we should have done a "closing the books" pattern, but we didn't
    // So instead, let's just "compact" the stream

    await session.Events.CompactStreamAsync<Equipment>(equipmentId, x =>
    {
        // We could say "compact" all events for this stream
        // from version 1000 and below
        x.Version = 1000;

        // Or instead say, "compact all events older than 30 days ago":
        x.Timestamp = DateTimeOffset.UtcNow.Subtract(30.Days());

        // Carry out some kind of user defined archiving process to
        // "move" the about to be archived events to something like an S3 bucket
        // or an Azure Blob or even just to another table
        x.Archiver = archiver;

        // Pass in a cancellation token because this might take a bit...
        x.CancellationToken = CancellationToken.None;
    });
}

What this “compacting” does is effectively create a snapshot of the stream state (the Equipment type in the example above) and replaces the existing events that are archived in the database with a single Compacted<Equipment> event with this shape:

// Right now we're just "compacting" in place, but there's some
// thought to extending this to what one of our contributors
// calls "re-streaming" in their system where they write out an
// all new stream that just starts with a summary
public record Compacted<T>(T Snapshot, Guid PreviousStreamId, string PreviousStreamKey)

The latest, greatest Marten projection bits are always able to restart any SingleStreamProjection with the Snapshot data of a Compacted<T> event, with no additional coding on your part.

And now, to answer a few questions that my client (Carsten, this one’s for you, sorry I was slow today:)) asked me about this today:

  • Is there going to be a default archiver? Not yet, but I’m all ears on what that could or should be. It’ll always be pluggable of course because I’d expect a wide range of usages
  • How about async projections? This will not impact asynchronous projections that are already in flight. The underlying mechanism is not using any persisted, projected document state but is instead fetching the raw events and effectively doing a live aggregation to come back to the compacted version of the projected document.
  • Can you compact a single stream multiple times? Yes. I’m thinking folks could use a projection “side effect” to emit a request message to compact a stream every 1,000 events or some other number.
  • What happens in case the async daemon moves beyond (e.g. new events were saved while the compacting is ongoing) – will the compacting aggregation overwrite the projection updates done by the async daemon – basically the same for inline projections? The compacting will be done underneath the async daemon, but will not impact the daemon functionality. The projections are “smart enough” to restart the snapshot state from any Compacted<T> event found in the middle of the current events anyway.
  • How does rewind and replay work if a stream is compacted? Um, you would only be able to replay at or after the point of compacting. But we can talk about making this able to recover old events from archiving in a next phase!
  • Any other limitations? Yeah, same problem we ran into with the “optimized rebuild” feature from Marten 7.0. This will not play well if there are more than one single stream projection views for the same type of stream. Not insurmountable, but definitely not convenient. I think you’d have to explicitly handle a Compacted<T1> event in the projection for T2 if both T1 and T2 are separate views of the same stream type.
  • Why do I care? You probably don’t upfront, but this might easily be a way to improve the performance and scalability of a busy system over time as the database grows.
  • Is this a replacement or alternative to the event archival partitioning from Marten 7? You know, I’m not entirely sure, and I think your usage may vary. But if your database is likely to grow massively large over time and you can benefit from shrinking the size of the “hot” part of the database of events you no longer care about, do at least one or both of these options!

Summary

The widespread advice from event sourcing experts is to “keep your streams short”, but I also partially suspect this is driven by technical limitations of some of the commonly used, early commercial event store tools. I also believe that Marten is less impacted by long stream sizes than many other event store tools, but still, smaller databases will probably outperform bigger ones in most cases.

Leave a comment