A couple weeks ago I wrote a blog post on the new “Async Daemon” feature in Marten. This post is a bit that I cut out of that post just describing the challenges I faced and what I did to slide around the problems. For all Marten users that have been asking me about writing their own subsystem to read and process events offline, you really want to read this post to understand why that’s much harder than you’d think and why you do probably want to just help make the async daemon solid.
The first challenge for the async daemon was “knowing” when there are new events that need to be processed by async projections. When a projection runs, it needs to process the events in the same order that they were captured in. Since the async daemon was inevitably going to use some sort of polling (NOTIFY/LISTEN in Postgresql was not adequate by itself) to read events out of the event table, we needed a very efficient way to be able to page the event fetching without missing events.
We started Marten with the thought that we would try to accomplish that by having the event store enqueue the events in a rolling buffer table that some kind of offline process would poll and read, but we were talked out of that approach in discussions with a Postgresql consultant who was helping us at work. Moreover, as I worked through other use cases to rebuild projections from scratch or add new projections later, we realized that the rolling buffer table would never have worked for the async daemon.
We also experimented with using sequential Guid’s as the global identifier for events in the event store with the idea that we would be able to use that to key off of for the projections by always querying for “Id > [last event id encountered].” In my testing I was unable to get the sequential Guid algorithm to accurately order the event id’s, especially under a heavy parallel load.
In the end, we opted to make the event store table in Marten use a sequential long integer as its primary key, and backed that with a database SEQUENCE. That gave us a more reliable way to “know” what events were new for each individual projection. In testing I figured out pretty quickly that the async daemon was missing events when there’s a lot of concurrent events streaming in because of event sequence id’s being reserved from in flight transactions. To counteract that problem, I ended up taking a two step process:
- Limit the async daemon to only querying against events that were captured before some time of threshold (3 seconds is the default) to avoid missing events that are still in flight
- When the async daemon fetches a new page of events, it actually tries to check that there are no gaps in the event sequence, and if there is, it pauses a little bit, and tries again until there are no gaps in the sequence or if the subsequent fetch turns up the exact same data (leading the async daemon to believe that the missing events were rejected).
Those two steps — as far as I can tell — have eliminated the problems I was seeing before about missing events in flight. It did completely ruin a family dinner at our favorite Thai restaurant when I couldn’t make myself stop thinking about how to slide around the problems in event ordering;)
The other killer problem was in trying to make the async daemon resilient in the face of potential connectivity problems and occasional projection failures without losing any results. I’ll try to blog about that in a later post.