Optimizing Marten Part 2

This is an update to an earlier blog post on optimizing for performance in Marten. Marten is a new OSS project I’m working on that allows .Net applications to treat the Postgresql database as a document database. Our hope at work is that Marten will be a more performant and easier to support replacement in our ecosystem for RavenDb (and possibly a replacement event store mechanism inside of the applications that use event sourcing, but that’s going to come later).

Before we should think about using Marten for real, we’re undergoing some efforts to optimize the performance both in reading and writing data from Postgresql.

Optimizing Queries with Indexes

In my previous post, my former colleague Joshua Flanagan suggested using the Postgresql containment operator and gin indexes as part of my performance comparisons. After adding some ability to define database indexes for Marten document types like this:

public class ContainmentOperator : MartenRegistry
{
    public ContainmentOperator()
    {
        // For persisting a document type called 'Target'
        For<Target>()

            // Use a gin index against the json data field
            .GinIndexJsonData()

            // directs Marten to try to use the containment
            // operator for querying against this document type
            // in the Linq support
            .PropertySearching(PropertySearching.ContainmentOperator);
    }
}

and like this for indexing what we’re calling “searchable” fields where Marten duplicates some element of a document into a separate database column for optimized searching:

public class DateIsSearchable : MartenRegistry
{
    public DateIsSearchable()
    {
        // This can also be done with attributes
        // This automatically adds a "BTree" index
        For<Target>().Searchable(x => x.Date);
    }
}

As of now, when you choose to make a field or property of a document “searchable”, Marten is automatically adding a database index to that column on the document storage table. By default, the index is the standard Postgresql btree index, but you do have the ability to override how the index is created.

Now that we have support for querying using the containment operator and support for defining indexes, I reran the query performance tests and updated the results with some new data:

Serializer: JsonNetSerializer

Query Type	1K	10K	100K
JSON Locator Only	7	77.2	842.4
jsonb_to_record + lateral join	9.4	88.6	1170.4
searching by duplicated field	1	16.4	135.4
searching by containment operator	4.6	14.8	132.4

Serializer: JilSerializer

Query Type	1K	10K	100K
JSON Locator Only	6	54.8	827.8
jsonb_to_record + lateral join	8.6	76.2	1064.2
searching by duplicated field	1	6.8	64
searching by containment operator	4	7.8	66.8

Again, searching by a field that is duplicated as a simple database column with a btree index is clearly the fastest approach. The containment operator plus gin index comes in second, and may be the best choice when you will have to issue many different kinds of queries against the same document type. Based on this data, I think that we’re going to make the containment operator be the preferred way of querying json documents, but fallback to using the json locator approach for all other query operators besides equality tests.

I still think that we have to ship with Newtonsoft.Json as our default json serializer because of F# and polymorphism concerns among other things, but if you can get away with it for your document types, Jil is clearly much faster.

There is some conversation in the Marten Gitter room about possibly adding gin indexes to every document type by default, but I think we first need to pay attention to the data in the next section:

Insert Timings

The querying is definitely important, but we certainly want the write side of Marten to be fast too. We’ve had what we call “BulkInsert” support using Npgsql & Postgresql’s facility for bulk copying. Recently, I’ve changed Marten’s internal unit of work class to issue all of its delete and “upsert” commands in one single ADO.Net DbCommand to try to execute multiple sql statements in a single network round trip.

My best friend the Oracle database guru (I’ll know if he reads this because he’ll be groaning about the Oracle part;)) suggested that this approach might not matter against issuing multiple ADO.Net commands against the same stateful transaction and connection, but we were both surprised by how much difference batching the SQL commands turned out to be.

To better understand the impact on insert timing using our bulk insert facility, the new batched update mechanism, and the original “ADO.Net command per document update” approach, I ran a series of tests that tried to insert 500 documents using each technique.

Because we also need to understand the implications on insertion and update timing of using the searchable, duplicated fields and gin indexes (there is some literature in the Postgresql docs stating that gin indexes could be expensive on the write side), I ran each permutation of update strategy against three different indexing strategies on the document storage table:

No indexes whatsoever
A duplicated field with a btree index
Using a gin index against the JSON data column

And again, just for fun, I used both the Newtonsoft.Json and Jil serializers to also understand the impact that they have on performance.

You can find the code I used to make these tables in GitHub in the insert_timing class.

Using Newtonsoft.Json as the Serializer

Index	Bulk Insert	Batch Update	Command per Document
No Index	62	149	244
Duplicated Field w/ Index	53	152	254
Gin Index on Json	96	186	300

Using Jil as the Serializer

Index	Bulk Insert	Batch Update	Command per Document
No Index	47	134	224
Duplicated Field w/ Index	57	151	245
Gin Index on Json	79	180	270

As you can clearly see, the new batch update mechanism looks to be a pretty big win for performance over our original, naive “command per document” approach. The only downside is that this technique has a certain ceiling insofar as how many or how large the documents can be before the single command exceeds technical limits. For right now, I think I’d like to simply beat that problem with documentation pushing users to using the bulk insert mechanism for large data sets. In the longer term, we’ll throttle the batch update by paging updates into some to be determined number of document updates at a time.

The key takeaway for me just reinforces the very first lesson I had drilled into me about software performance: network round trips are evil. We are certainly reducing the number of network round trips between our application and the database server by utilizing the command batching.

You can also see that using a gin index slows down the document updates considerably. I think the only good answer to users is that they’ll have to do performance testing as always.

Other Optimization Things

We’ve been able to cutdown on Reflection hits and dynamic runtime behavior by using Roslyn as a crude metaprogramming mechanism to just codegen the document storage code.
Again in the theme of reducing network round trips, we’re going to investigate being able to batch up deferred queries into a single request to the Postgresql database.
We’re not sure about the details yet, but we’ll be investigating approaches for using asynchronous projections inside of Postgresql (maybe using Javascript running inside of the database, maybe .Net code in an external system, maybe both approaches).
I’m leaving the issues out in http://up-for-grabs.net, but we’ll definitely add the ability to just retrieve the raw JSON so that HTTP endpoints could stream data to clients without having to take the unnecessary hit of deserializing to a .Net type just to immediately serialize right back to JSON for the HTTP response. We’ll also support a completely asynchronous querying and update API for maximum scalability.