Optimizing for Performance in Marten

For the last couple weeks I’ve been working on a new project called Marten that is meant to exploit Postgresql’s JSONB data as a full fledged document database for .Net development as a drop in replacement for RavenDb in our production environment. I think that I would say that our primary goal with Marten is improved stability and supportability, but maximizing performance and throughput is a very close second in the priority list.

This is my second update on Marten progress. From last week, also see Marten Development So Far.

So far, I’ve mostly been focusing on optimizing the SQL queries generated by the Linq support for faster fetching. I’ve been experimenting with a few different query modes for the SQL generation based on what fields or properties you’re trying to search on:

  1. By default in the absence of any explicit configuration, Marten tries to use the “jsonb_to_record” function with a LATERAL join approach to optimize queries against members on the root of the document.
  2. You can also force Marten to only use basic Postgresql JSON locators to generate the where clauses in the SQL statements
  3. Finally, if you know that your application will be frequently querying a document type against a certain member, Marten can use a “searchable” field such that it duplicates that data in a normal database field and searches directly against that database field. This mechanism will clearly slow down your inserts and take up somewhat more storage space, but the numbers I’m about to display don’t lie, this is very clearly the fastest way to optimize queries using Marten (so far).

I’ve also experimented with both the Newtonsoft.Json serializer and the faster, but less flexible Jil serializer. Again, the numbers are pretty clear that for bigger result sets, Jil is much faster (NetJSON was a complete bust for me when I tried it). So far I’ve been able to keep Marten serializer-agnostic and I can easily see times when you’d have to opt for Newtonsoft’s flexibility.

Default jsonb_to_record/LATERAL JOIN

Using this approach, the SQL generated is:

select d.data from mt_doc_target as d, LATERAL jsonb_to_record(d.data) as l("Date" date) where l."Date" = :arg0

Json Locators Only

While you can configure this behavior on a field by field basis, the quickest way is to just set the default document behavior:

public class JsonLocatorOnly : MartenRegistry
{
    public JsonLocatorOnly()
    {
        // This can also be done with attributes
        For<Target>().PropertySearching(PropertySearching.JSON_Locator_Only);
    }
}

With this setting, the generated SQL is:

select d.data from mt_doc_target as d where CAST(d.data ->> 'Date' as date) = :arg0

Searchable, Duplicated Field

Again, to configure this option, I used this code:

public class DateIsSearchable : MartenRegistry
{
    public DateIsSearchable()
    {
        // This can also be done with attributes
        For<Target>().Searchable(x => x.Date);
    }
}

When I do this, the table for the Target type has an additional field called “date” that will get the value of the Target.Date property every time a Target object is inserted or updated in the database.

The resulting SQL is:

select d.data from mt_doc_target as d where d.date = :arg0

The Performance Results

I created the table below by generating randomized data, then trying to search by a DateTime field using three different mechanisms:

var theDate = DateTime.Today.AddDays(3);
var queryable = session.Query<Target>().Where(x => x.Date == theDate);

In all cases, I used the same sample data for the document count and took an average of running the same query five times after throwing out an initial attempt where Postgresql seemed to be “warming up” the JSONB data.

Serializer: JsonNetSerializer

Query Type 1K 10K 100K 1M
JSON Locator Only 9.6 75.2 691.2 9648
jsonb_to_record + lateral join 10 93.6 922.6 12091.2
searching by duplicated field 2.4 15 169.6 2777.8

Serializer: JilSerializer

Query Type 1K 10K 100K 1M
JSON Locator Only 6.8 61 594.8 7265.6
jsonb_to_record + lateral join 8.4 86.6 784.2 9655.8
searching by duplicated field 1 8.8 115.4 2234.2

To be honest, I expected the JSONB_TO_RECORD + LATERAL JOIN mechanism to be faster than the JSON locator only approach, but I need to go back and try to add some indexes because that’s supposed to be the benefit of using JSONB_TO_RECORD to avoid the object casts that inevitably defeat indexes. I’d be happy to get some Postgresql gurus to weigh in here if there are any reading this.

If you’re curious to see my mechanism for recording this data, see the performance_tuning code file in GitHub.

Bulk Loading Documents

From time to time (testing or data migrations maybe) you’ll have some need to very rapidly load a large set of documents into your database. I added a feature this morning to Marten that exploits Postgresql’s COPY feature supported by Npgsql:

public void load_with_small_batch()
{
    // This is just creating some randomized
    // document data
    var data = Target.GenerateRandomData(100).ToArray();

    // Load all of these into a Marten-ized database
    theSession.BulkLoad(data);

    // And just checking that the data is actually there;)
    theSession.Query<Target>().Count().ShouldBe(data.Length);
    theSession.Load<Target>(data[0].Id).ShouldNotBeNull();
}

Behind the scenes, Marten is using code generation at runtime and compiled by Roslyn to do the bulk loading as efficiently as possible without any hit from using reflection:

public void Load(ISerializer serializer, NpgsqlConnection conn, IEnumerable documents)
{
    using (var writer = conn.BeginBinaryImport("COPY mt_doc_target(id, data) FROM STDIN BINARY"))
    {
        foreach (var x in documents)
        {
            writer.StartRow();
            writer.Write(x.Id, NpgsqlDbType.Uuid);
            writer.Write(serializer.ToJson(x), NpgsqlDbType.Jsonb);
        }
    }
}

Do note that the code generation mechanism is smart enough to also add any fields or properties of the document type that are marked as duplicated for searching.

Other Outstanding Optimization Tasks 

  • Optimize the mechanics for applying all the changes in a unit of work. I’m hoping that we can do something to reduce the number of network round trips between the application and the postgresql server. My fallback approach is going to be to use a custom PLV8 sproc, but not until we exhaust other possibilities with the Npgsql library.
  • I want some mechanism for queuing up queries and submitting them in one network round trip
  • The ability to make a named, reusable Linq query so you can reuse the underlying ADO.Net command generated from parsing the Linq expression without having to go through all the Expression parsing gymnastics on each usage
  • Really more for scalability than performance, but we’ll get around to asynchronous query methods. I’m just not judging that to be a critical path item right now.
  • It’s probably minor in the grand scheme of things, but the actual Linq expression to Sql query generation is grotesque in how it concatenates strings

Feel very free to make suggestions and other feedback on these items;-)

12 thoughts on “Optimizing for Performance in Marten

  1. It isn’t clear what indexes you already have on the table. You can significantly improve the speed of the “json locators” approach with 2 changes:

    1) add a jsonb_path_ops GIN index:

    CREATE INDEX idx_mt_doc_target_data_path ON mt_doc_target USING gin (data jsonb_path_ops);

    2) Use the containment operator:

    select d.data from mt_doc_target as d where d.data @> ‘{“Date”: :arg0}’;

    I’m actually not sure how well that variable binding will work when building the right-side JSON document. Try it with literal values if that doesnt work:

    select d.data from mt_doc_target as d where d.data @> ‘{“Date”:”‘2015-11-11″}’;

    I think you will find that the duplicate field is still the fastest, but it should be a much closer race with the jsonb locator approach.

    The containment operator also works well if you need to match on multiple values. For example, if you wanted to find documents with a specific date, author, and type. Imagine the author is stored as a JSON array in the document, since you support multiple authors. This will match, even if Jeremy is not the only author:

    select d.data from mt_doc_target as d where d.data @> ‘{“Date”:”‘2015-11-11″, “Type”: “blog”, “Author”: [“Jeremy”]}’;

    1. Hey Josh, it’s been awhile. I didn’t use squat for indexes because I really didn’t know where to start yet. I will certainly try out your suggestions and update the numbers next week.

      I wasn’t sure about whether or not the @> operator would be useful for searching on anything besides strings.

      Thanks for the tips,

      Jeremy

  2. Containment will work for non-string types. Although, I’m not sure about dates. JSON and dates have a funny relationship – as in, the relationship doesn’t exist. Your best best is to use a consistent string format of the date when inserting and querying.
    And if you haven’t done any indexes yet, the JSON locator approach (with the path_ops GIN index) will outperform the unindexed duplicated field by far. Once you add a btree index to the duplicate field, it should (slightly) win out. But the perf difference might not be enough to justify the duplicate fields.
    http://www.postgresql.org/docs/9.4/static/datatype-json.html#JSON-INDEXING

Leave a comment