Batch Queries with Marten

Marten v0.7 was published just under two weeks ago, and one of the shiny new features was the batched query model with let’s say a trial balloon syntax that was shot down pretty fast in the Marten Gitter room (I wasn’t happy with it either). To remedy that, we pushed a new Nuget this morning (v0.7.1) that has a new, streamlined syntax for the batched query and updated the batched query docs to match.

So here’s the problem it tries to solve, say you have an HTTP endpoint that needs to aggregate several different sources of document data into a single, aggregated JSON message back to your web client (this is a common scenario in a large application at my work that is going to be converted to Marten shortly). To speed up that JSON endpoint, you’d like to be able to batch up those queries into a single call to the underlying Postgresql database, but still have an easy way to get at the results of each query later. This is where Marten’s batch query functionality comes in as demonstrated below:

// Start a new IBatchQuery from an active session
var batch = theSession.CreateBatchQuery();

// Fetch a single document by its Id
var user1 = batch.Load<User>("username");

// Fetch multiple documents by their id's
var admins = batch.LoadMany<User>().ById("user2", "user3");

// User-supplied sql
var toms = batch.Query<User>("where first_name == ?", "Tom");

// Query with Linq
var jills = batch.Query<User>().Where(x => x.FirstName == "Jill").ToList();

// Any() queries
var anyBills = batch.Query<User>().Any(x => x.FirstName == "Bill");

// Count() queries
var countJims = batch.Query<User>().Count(x => x.FirstName == "Jim");

// The Batch querying supports First/FirstOrDefault/Single/SingleOrDefault() selectors:
var firstInternal = batch.Query<User>().OrderBy(x => x.LastName).First(x => x.Internal);

// Kick off the batch query
await batch.Execute();

// All of the query mechanisms of the BatchQuery return
// Task's that are completed by the Execute() method above
var internalUser = await firstInternal;
Debug.WriteLine($"The first internal user is {internalUser.FirstName} {internalUser.LastName}");

Using the batch query is a four step process:

  1. Start a new batch query by calling IDocumentSession.CreateBatchQuery()
  2. Define the queries you want to execute by calling the Query() methods on the batch query object. Each query operator returns a Task<T> object that you’ll use later to access the results after the query has completed (under the covers it’s just a TaskCompletionSource).
  3. Execute the entire batch of queries and await the results
  4. Access the results of each query in the batch, either by using the await keyword or Task.Result.

 

A Note on our Syntax vis a vis RavenDb

You might note that the Marten syntax is quite a bit different syntax-wise and even conceptually to RavenDb’s Lazy Query feature. While we originally started Marten with the idea that we’d stay very close to RavenDb’s API to make the migration effort less difficult, we’re starting to deviate as we see fit. In this particular case, I wanted the API to be more explicit about the contents and lifecycle of the batched query. In other cases like the forthcoming “Include Query” feature, we will probably stay very close to RavenDb’s syntax if we don’t have any better ideas or strong reason to deviate from the existing art.

 

A Note on “Living” Documentation

I’ve received a lot of criticism over the years for having inadequate, missing, or misleading documentation for the OSS projects I’ve ran. Starting with Storyteller 3.0 and StructureMap 4.0 last year and now Marten this year, I’ve been having some success using Storyteller’s static website generation to author technical documentation in a way that’s been easy to keep code samples and content up to date with changes to the underlying tool. In the case of the batched query syntax from Marten above, the code samples are pulled directly from the acceptance tests for the feature. As soon as I made the changes to the code, I was able to update the documentation online to reflect the new syntax from running a quick script and pushing to the gh-pages branch of the Marten repository. All told, it took me under a minute to refresh the content online.

New Features and Improvements in Marten 0.7

The Marten project was launched about 6 months ago as a proof of concept that we could really treat Postgresql as a document database, an event store, and a potential replacement for a problematic subsystem at work. Right now, Marten is starting to look like a potentially successful OSS project with an increasingly active and engaged community. If you’re interested in using Postgresql, Document Db, or event sourcing in .Net, you may want to check out Marten’s website or jump into the discussions in the Marten Gitter room.

Marten development has been proceeding much faster over the past couple weeks as a lot of useful feedback and pull requests are flowing in from early adopters and I’m able to dedicate quite a bit of time at work to Marten in preparation for us converting some of our applications over. Only a couple weeks after a pretty sizable v0.6 release, I was just able to upload a new Marten v0.7 nuget as well as publish updated documentation for the new changes.

While you can see the entire list of changes from the GitHub issue list for this milestone, the big, flashy changes are:

  1. After several related requests, the database connection is now “sticky” to an IDocumentSession and the underlying database connection is exposed off of the interface. Among other things, this change allows users to integrate Dapper usage inside the same transaction boundaries as Marten. This change also allows you to specify the isolation level of the underlying transaction. See the documentation for a sample usage of this new feature.
  2. You can opt into storing a hierarchy of document types as a single database table and logical document collection. See the documentation topic for information on using this feature.
  3. Batched queries for potentially improved performance if you need to make several database requests at one time.
  4. The results of Linq queries are integrated with Marten’s Identity Map features
  5. Improved Linq query support for child collections

In addition to the big ticket items above, Marten improved the internals of its asynchronous query methods (thanks to Daniel Marbach), the robustness of its decision making on when and when not to regenerate tables, and ability to use reserved Postgresql names as columns.

What’s next for Marten?

Right now the obvious consensus in the Marten community seems to be that we need to get serious with read side projection support, transformations, and some equivalent to RavenDb’s Include feature. Beyond that, I want to get some kind of instrumentation or logging story going and there’s a handful of “if only Marten had this *one* feature I could switch over” features in our issue list.

It’s not completely set yet, but the theoretical plans for the next v0.8 release are listed on GitHub.

If there’s any time soon, I’d like to restart some work on the event store half of Marten, but that has to remain a lower priority for me just based on what we think we need first at work.

Marten Takes a Big Step Forward with v0.6

EDIT: Nuget v0.6.1 is already up with some improvements to the async code in Marten. Hat tip to Daniel Marbach for his pull request on that one.

Marten is a new OSS project that seeks to turn Postgresql into a robust, usable document database (and an event store someday) for .Net development. There’s a recording of an internal talk I gave introducing Marten at work live on YouTube for more background.

Marten v0.6 just went live on nuget this afternoon. This turned into a pretty substantial release that I feel makes Marten much more robust, usable, and generally a lot closer to ready for production usage in bigger, more complicated systems.

This release came with substantial contributions from other developers and incorporates feedback from early adopters. I’d like to thank (in no particular order) Jens Pettersson, Corey Kaylor, Bojan Veljanovski, Jeff Doolittle, Phillip Haydon, and Evgeniy Kulakov for their contributions and feedback in this release.

What’s New:

You can see the complete set of changes from the v0.6 milestone on GitHub.

So, what’s next?

More than anything, I’m hoping to get more early adopters giving us feedback (and pull requests!) on what’s missing, what’s not easy to use, and where it needs to change. I think I’ll get the chance to try converting a large project from RavenDb to Marten soon that should help as well.

Feature wise, I think the next couple things up for a future v0.7 release would be:

  • Batched queries (futures)
  • Readside projections, but whether that’s going to be via Javascript, .Net transforms, or both is yet to be determined
  • Using saved queries to avoid unnecessarily taking the hit of Linq expression parsing

“Introduction to Marten” Video

I gave an internal talk today at our Salt Lake City office on Marten that we were able to record and post publicly. I discussed why Postgresql, why or when to choose a document database over a relational database, what’s already done in Marten, and where it still needs to go.

And of course, if you just wanna know what Marten is, the website is here.

Any feedback is certainly welcome here or in the Marten Gitter room.

Today I learned that the only thing worse than doing a big, important talk on not enough sleep is doing two talks and a big meeting on technical strategy on the same day.

Marten is Ready for Early Adopters

I’ve been using RavenDb for development over the past several years and I’m firmly convinced that there’s a pretty significant productivity advantage to using document databases over relational databases for many systems. For as much as I love many of the concepts and usability of RavenDb, it isn’t running very successfully at work and it’s time to move our applications to something more robust. Fortunately, we’ve been able to dedicate some time toward using Postgresql as a document database. We’ve been able to do this work as a new OSS project called Marten. Our hope with Marten has been to retain the development time benefits of document databases (along with an easy migration path away from RavenDb) with a robust technological foundation — and even I’ll admit that it will occasionally be useful to fall back to using Postgresql as a relational database where that is still advantageous.

I feel like Marten is at a point where it’s usable and what we really need most is some early adopters who will kick the tires on it, give some feedback about how well it works, what’s missing that would make it easier to use, and how it’s performing in their systems. Fortunately, as of today, Marten now has (drum role please):

And of course, the Marten Gitter room is always open for business.

An Example Quickstart

To get started with Marten, you need two things:

  1. A Postgresql database schema (either v9.4 or v9.5)
  2. The Marten nuget installed into your application

After that, the quickest way to get up and running is shown below with some sample usage:

var store = DocumentStore.For("your connection string");

Now you need a document type that will be persisted by Marten:

    public class User
    {
        public Guid Id { get; set; }
        public string FirstName { get; set; }
        public string LastName { get; set; }
        public bool Internal { get; set; }
        public string UserName { get; set; }
    }

As long as a type can be serialized and deserialized by the JSON serializer of your choice and has a public field or property called “Id” or “id”, Marten can persist and load it back later.

To persist and load documents, you use the IDocumentSession interface:

    using (var session = store.LightweightSession())
    {
        var user = new User {FirstName = "Han", LastName = "Solo"};
        session.Store(user);

        session.SaveChanges();
    }

 

 

 

 

Optimizing Marten Part 2

This is an update to an earlier blog post on optimizing for performance in Marten. Marten is a new OSS project I’m working on that allows .Net applications to treat the Postgresql database as a document database. Our hope at work is that Marten will be a more performant and easier to support replacement in our ecosystem for RavenDb (and possibly a replacement event store mechanism inside of the applications that use event sourcing, but that’s going to come later).

Before we should think about using Marten for real, we’re undergoing some efforts to optimize the performance both in reading and writing data from Postgresql.

Optimizing Queries with Indexes

In my previous post, my former colleague Joshua Flanagan suggested using the Postgresql containment operator and gin indexes as part of my performance comparisons. After adding some ability to define database indexes for  Marten document types like this:

public class ContainmentOperator : MartenRegistry
{
    public ContainmentOperator()
    {
        // For persisting a document type called 'Target'
        For<Target>()

            // Use a gin index against the json data field
            .GinIndexJsonData()

            // directs Marten to try to use the containment
            // operator for querying against this document type
            // in the Linq support
            .PropertySearching(PropertySearching.ContainmentOperator);
    }
}

and like this for indexing what we’re calling “searchable” fields where Marten duplicates some element of a document into a separate database column for optimized searching:

public class DateIsSearchable : MartenRegistry
{
    public DateIsSearchable()
    {
        // This can also be done with attributes
        // This automatically adds a "BTree" index
        For<Target>().Searchable(x => x.Date);
    }
}

As of now, when you choose to make a field or property of a document “searchable”, Marten is automatically adding a database index to that column on the document storage table. By default, the index is the standard Postgresql btree index, but you do have the ability to override how the index is created.

Now that we have support for querying using the containment operator and support for defining indexes, I reran the query performance tests and updated the results with some new data:

Serializer: JsonNetSerializer

Query Type 1K 10K 100K
JSON Locator Only 7 77.2 842.4
jsonb_to_record + lateral join 9.4 88.6 1170.4
searching by duplicated field 1 16.4 135.4
searching by containment operator 4.6 14.8 132.4

Serializer: JilSerializer

Query Type 1K 10K 100K
JSON Locator Only 6 54.8 827.8
jsonb_to_record + lateral join 8.6 76.2 1064.2
searching by duplicated field 1 6.8 64
searching by containment operator 4 7.8 66.8

Again, searching by a field that is duplicated as a simple database column with a btree index is clearly the fastest approach. The containment operator plus gin index comes in second, and may be the best choice when you will have to issue many different kinds of queries against the same document type. Based on this data, I think that we’re going to make the containment operator be the preferred way of querying json documents, but fallback to using the json locator approach for all other query operators besides equality tests.

I still think that we have to ship with Newtonsoft.Json as our default json serializer because of F# and polymorphism concerns among other things, but if you can get away with it for your document types, Jil is clearly much faster.

There is some conversation in the Marten Gitter room about possibly adding gin indexes to every document type by default, but I think we first need to pay attention to the data in the next section:

 

Insert Timings

The querying is definitely important, but we certainly want the write side of Marten to be fast too. We’ve had what we call “BulkInsert” support using Npgsql & Postgresql’s facility for bulk copying. Recently, I’ve changed Marten’s internal unit of work class to issue all of its delete and “upsert” commands in one single ADO.Net DbCommand to try to execute multiple sql statements in a single network round trip.

My best friend the Oracle database guru (I’ll know if he reads this because he’ll be groaning about the Oracle part;)) suggested that this approach might not matter against issuing multiple ADO.Net commands against the same stateful transaction and connection, but we were both surprised by how much difference batching the SQL commands turned out to be.

To better understand the impact on insert timing using our bulk insert facility, the new batched update mechanism, and the original “ADO.Net command per document update” approach, I ran a series of tests that tried to insert 500 documents using each technique.

Because we also need to understand the implications on insertion and update timing of using the searchable, duplicated fields and gin indexes (there is some literature in the Postgresql docs stating that gin indexes could be expensive on the write side), I ran each permutation of update strategy against three different indexing strategies on the document storage table:

  1. No indexes whatsoever
  2. A duplicated field with a btree index
  3. Using a gin index against the JSON data column

And again, just for fun, I used both the Newtonsoft.Json and Jil serializers to also understand the impact that they have on performance.

You can find the code I used to make these tables in GitHub in the insert_timing class.

Using Newtonsoft.Json as the Serializer

Index Bulk Insert Batch Update Command per Document
No Index 62 149 244
Duplicated Field w/ Index 53 152 254
Gin Index on Json 96 186 300

 

Using Jil as the Serializer

Index Bulk Insert Batch Update Command per Document
No Index 47 134 224
Duplicated Field w/ Index 57 151 245
Gin Index on Json 79 180 270

As you can clearly see, the new batch update mechanism looks to be a pretty big win for performance over our original, naive “command per document” approach. The only downside is that this technique has a certain ceiling insofar as how many or how large the documents can be before the single command exceeds technical limits. For right now, I think I’d like to simply beat that problem with documentation pushing users to using the bulk insert mechanism for large data sets. In the longer term, we’ll throttle the batch update by paging updates into some to be determined number of document updates at a time.

The key takeaway for me just reinforces the very first lesson I had drilled into me about software performance: network round trips are evil. We are certainly reducing the number of network round trips between our application and the database server by utilizing the command batching.

You can also see that using a gin index slows down the document updates considerably. I think the only good answer to users is that they’ll have to do performance testing as always.

 

Other Optimization Things

  • We’ve been able to cutdown on Reflection hits and dynamic runtime behavior by using Roslyn as a crude metaprogramming mechanism to just codegen the document storage code.
  • Again in the theme of reducing network round trips, we’re going to investigate being able to batch up deferred queries into a single request to the Postgresql database.
  • We’re not sure about the details yet, but we’ll be investigating approaches for using asynchronous projections inside of Postgresql (maybe using Javascript running inside of the database, maybe .Net code in an external system, maybe both approaches).
  • I’m leaving the issues out in http://up-for-grabs.net, but we’ll definitely add the ability to just retrieve the raw JSON so that HTTP endpoints could stream data to clients without having to take the unnecessary hit of deserializing to a .Net type just to immediately serialize right back to JSON for the HTTP response. We’ll also support a completely asynchronous querying and update API for maximum scalability.

 

Using Roslyn for Runtime Code Generation in Marten

I’m using Roslyn to dynamically compile and load assemblies built at runtime from generated code in Marten and other than some concern over the warmup time, it’s been going very well so far.

Like so many other developers with more cleverness than sense, I’ve spent a lot of time trying to build Hollywood Principle style frameworks that try to dynamically call application code at runtime through Reflection or some kind of related mechanism. Reflection itself has traditionally been the easiest mechanism to use in .Net to create dynamic behavior at runtime, but it can be a performance problem, especially if you use it naively.

A Look Back at What Came Before…

Taking my own StructureMap IoC tool as an example, over the years I’ve accomplished dynamic runtime behavior in a couple different ways:

  1. Using IL directly using Reflection.Emit from the original versions through StructureMap 2.5. Working with IL is just barely a higher abstraction than assembly code and I don’t recommend using that if your goal is maintainability or making it easy for other developers to work in your code. I don’t miss generating IL by hand whatsoever. For those of you reading this and saying “pfft, IL isn’t so bad if you just understand how it works…”, my advice to you is to immediately go outside and get some fresh air and sunshine because you clearly aren’t thinking straight.
  2. From StructureMap 2.6 I crudely used the trick of building Expression trees representing what I needed to do, then compiling those Expression trees into objects of the right Func or Action signatures. This approach is easier – at least for me – because the Expression model is much closer semantically to the actual code you’re trying to mimic than the stack-based IL.
  3. From StructureMap 3.* on, there’s a much more complex dynamic Expression compilation model that’s robust enough to call constructor functions, setter properties, thread in interception, and surround all of that with try/catch logic for expressive exception messages and pseudo stack traces.

The current dynamic Expression approach in the StructureMap 3/4 internals is mostly working out well, but I barely remember how it works and it would take me a good day to just to get back into that code if I ever had to change something.

What if instead we could just work directly in plain old C# that we largely know and understand, but somehow get that compiled at runtime instead? Well, thanks to Roslyn and its “compiler as a service”, we now can.

I’ve said before that I want to eventually replace the Expression compilation with the Roslyn code compilation shown in this post, but I’m not sure I’m ambitious enough to mess with a working project.

How Marten uses Roslyn Runtime Generation 

As I explained in my last blog post, Marten generates some “glue code” to connect a document object to the proper ADO.Net command objects for loading, storing, or deleting. For each document class, Marten generates an IDocumentStorage class with this signature:

public interface IDocumentStorage
{
    NpgsqlCommand UpsertCommand(object document, string json);
    NpgsqlCommand LoaderCommand(object id);
    NpgsqlCommand DeleteCommandForId(object id);
    NpgsqlCommand DeleteCommandForEntity(object entity);
    NpgsqlCommand LoadByArrayCommand(TKey[] ids);
    Type DocumentType { get; }
}

In the test library, we have a class I creatively called “Target” that I’ve been using to test how Marten handles various .Net Types and queries. At runtime, Marten generates a class called TargetDocumentStorage that implements the interface above. Part of the generated code — modified by hand to clean up some extraneous line breaks and added comments — is shown below:

using Marten;
using Marten.Linq;
using Marten.Schema;
using Marten.Testing.Fixtures;
using Marten.Util;
using Npgsql;
using NpgsqlTypes;
using Remotion.Linq;
using System;
using System.Collections.Generic;

namespace Marten.GeneratedCode
{
    public class TargetStorage : IDocumentStorage, IBulkLoader, IdAssignment
    {
        public TargetStorage()
        {

        }

        public Type DocumentType => typeof (Target);

        public NpgsqlCommand UpsertCommand(object document, string json)
        {
            return UpsertCommand((Target)document, json);
        }

        public NpgsqlCommand LoaderCommand(object id)
        {
            return new NpgsqlCommand("select data from mt_doc_target where id = :id").WithParameter("id", id);
        }

        public NpgsqlCommand DeleteCommandForId(object id)
        {
            return new NpgsqlCommand("delete from mt_doc_target where id = :id").WithParameter("id", id);
        }

        public NpgsqlCommand DeleteCommandForEntity(object entity)
        {
            return DeleteCommandForId(((Target)entity).Id);
        }

        public NpgsqlCommand LoadByArrayCommand(T[] ids)
        {
            return new NpgsqlCommand("select data from mt_doc_target where id = ANY(:ids)").WithParameter("ids", ids);
        }

        // I configured the "Date" field to be a duplicated/searchable field in code
        public NpgsqlCommand UpsertCommand(Target document, string json)
        {
            return new NpgsqlCommand("mt_upsert_target")
                .AsSproc()
                .WithParameter("id", document.Id)
                .WithJsonParameter("doc", json).WithParameter("arg_date", document.Date, NpgsqlDbType.Date);
        }

        // This Assign() method would use a HiLo sequence generator for numeric Id fields
        public void Assign(Target document)
        {
            if (document.Id == System.Guid.Empty) document.Id = System.Guid.NewGuid();
        }

        public void Load(ISerializer serializer, NpgsqlConnection conn, IEnumerable documents)
        {
            using (var writer = conn.BeginBinaryImport("COPY mt_doc_target(id, data, date) FROM STDIN BINARY"))
            {
                foreach (var x in documents)
                {
                    writer.StartRow();
                    writer.Write(x.Id, NpgsqlDbType.Uuid);
                    writer.Write(serializer.ToJson(x), NpgsqlDbType.Jsonb);
                    writer.Write(x.Date, NpgsqlDbType.Date);
                }
            }
        }
    }
}

Now that you can see what code I’m generating at runtime, let’s move on to a utility for generating the code.

SourceWriter

SourceWriter is a small utility class in Marten that helps you write neatly formatted, indented C# code. SourceWriter wraps a .Net StringWriter for efficient string manipulation and provides some helpers for adding namespace using statements and tracking indention levels for you. After experimenting with some different usages, I mostly settled on using the Write(text) method that allows you to provide a section of code as a multi-line string. The TargetDocumentStorage code I showed above is generated from within a class called DocumentStorageBuilder with a call to the SourceWriter.Write() method shown below:

            writer.Write(
                $@"
BLOCK:public class {mapping.DocumentType.Name}Storage : IDocumentStorage, IBulkLoader<{mapping.DocumentType.Name}>, IdAssignment<{mapping.DocumentType.Name}>

{fields}

BLOCK:public {mapping.DocumentType.Name}Storage({ctorArgs})
{ctorLines}
END

public Type DocumentType => typeof ({mapping.DocumentType.Name});

BLOCK:public NpgsqlCommand UpsertCommand(object document, string json)
return UpsertCommand(({mapping.DocumentType.Name})document, json);
END

BLOCK:public NpgsqlCommand LoaderCommand(object id)
return new NpgsqlCommand(`select data from {mapping.TableName} where id = :id`).WithParameter(`id`, id);
END

BLOCK:public NpgsqlCommand DeleteCommandForId(object id)
return new NpgsqlCommand(`delete from {mapping.TableName} where id = :id`).WithParameter(`id`, id);
END

BLOCK:public NpgsqlCommand DeleteCommandForEntity(object entity)
return DeleteCommandForId((({mapping.DocumentType.Name})entity).{mapping.IdMember.Name});
END

BLOCK:public NpgsqlCommand LoadByArrayCommand(T[] ids)
return new NpgsqlCommand(`select data from {mapping.TableName} where id = ANY(:ids)`).WithParameter(`ids`, ids);
END


BLOCK:public NpgsqlCommand UpsertCommand({mapping.DocumentType.Name} document, string json)
return new NpgsqlCommand(`{mapping.UpsertName}`)
    .AsSproc()
    .WithParameter(`id`, document.{mapping.IdMember.Name})
    .WithJsonParameter(`doc`, json){extraUpsertArguments};
END

BLOCK:public void Assign({mapping.DocumentType.Name} document)
{mapping.IdStrategy.AssignmentBodyCode(mapping.IdMember)}
END

BLOCK:public void Load(ISerializer serializer, NpgsqlConnection conn, IEnumerable<{mapping.DocumentType.Name}> documents)
BLOCK:using (var writer = conn.BeginBinaryImport(`COPY {mapping.TableName}(id, data{duplicatedFieldsInBulkLoading}) FROM STDIN BINARY`))
BLOCK:foreach (var x in documents)
writer.StartRow();
writer.Write(x.Id, NpgsqlDbType.{id_NpgsqlDbType});
writer.Write(serializer.ToJson(x), NpgsqlDbType.Jsonb);
{duplicatedFieldsInBulkLoadingWriter}
END
END
END

END

");
        }

There’s a couple things to note about the code generation above:

  • String interpolation makes this so much easier than I think it would be with just string.Format(). Thank you to the C# 6 team.
  • Each line of code is written to the underlying StringWriter with the level of indention added to the left by SourceWriter itself
  • The “BLOCK” prefix directs SourceWriter to add an opening brace “{” to the next line, then increment the indention level
  • The “END” text directs SourceWriter to decrement the current indention level, then write a closing brace “}” to the next line and a blank line after that.

Now that we’ve got ourselves some generated code, let’s get Roslyn involved to compile it and actually get at an object of the new Type we want.

Roslyn Compilation with AssemblyGenerator

Based on a blog post by Tugberk Ugurlu, I built the AssemblyGenerator class in Marten shown below that invokes Roslyn to compile C# code and load the new dynamically built Assembly into the application:

public class AssemblyGenerator
{
    private readonly IList _references = new List();

    public AssemblyGenerator()
    {
        ReferenceAssemblyContainingType<object>();
        ReferenceAssembly(typeof (Enumerable).Assembly);
    }

    public void ReferenceAssembly(Assembly assembly)
    {
        _references.Add(MetadataReference.CreateFromFile(assembly.Location));
    }

    public void ReferenceAssemblyContainingType<T>()
    {
        ReferenceAssembly(typeof (T).Assembly);
    }

    public Assembly Generate(string code)
    {
        var assemblyName = Path.GetRandomFileName();
        var syntaxTree = CSharpSyntaxTree.ParseText(code);

        var references = _references.ToArray();
        var compilation = CSharpCompilation.Create(assemblyName, new[] {syntaxTree}, references,
            new CSharpCompilationOptions(OutputKind.DynamicallyLinkedLibrary));


        using (var stream = new MemoryStream())
        {
            var result = compilation.Emit(stream);

            if (!result.Success)
            {
                var failures = result.Diagnostics.Where(diagnostic =>
                    diagnostic.IsWarningAsError ||
                    diagnostic.Severity == DiagnosticSeverity.Error);


                var message = failures.Select(x => $"{x.Id}: {x.GetMessage()}").Join("\n");
                throw new InvalidOperationException("Compilation failures!\n\n" + message + "\n\nCode:\n\n" + code);
            }

            stream.Seek(0, SeekOrigin.Begin);
            return Assembly.Load(stream.ToArray());
        }
    }
}

At runtime, you use the AssemblyGenerator class by telling it which other assemblies it should reference and giving it the source code to compile:

// Generate the actual source code
var code = GenerateDocumentStorageCode(mappings);

var generator = new AssemblyGenerator();

// Tell the generator which other assemblies that it should be referencing 
// for the compilation
generator.ReferenceAssembly(Assembly.GetExecutingAssembly());
generator.ReferenceAssemblyContainingType<NpgsqlConnection>();
generator.ReferenceAssemblyContainingType<QueryModel>();
generator.ReferenceAssemblyContainingType<DbCommand>();
generator.ReferenceAssemblyContainingType<Component>();

mappings.Select(x => x.DocumentType.Assembly).Distinct().Each(assem => generator.ReferenceAssembly(assem));

// build the new assembly -- this will blow up if there are any
// compilation errors with the list of errors and the actual code
// as part of the exception message
var assembly = generator.Generate(code);

Finally, once you have the new Assembly, use Reflection just to find the new Type you want by either searching through Assembly.GetExportedTypes() or by name. Once you have the Type object, you can build that object through Activator.CreateInstance(Type) or any of the other normal Reflection mechanisms.

The Warmup Problem

So I’m very happy with using Roslyn in this way so far, but the initial “warmup” time on the very first usage of the compilation is noticeably slow. It’s a one time hit on startup, but this could get annoying when you’re trying to quickly iterate or debug a problem in code by frequently restarting the application. If the warmup problem really is serious in real applications, we may introduce a mode that just lets you export the generated code to file and have that code compiled with the rest of your project for much faster startup times.

Optimizing for Performance in Marten

For the last couple weeks I’ve been working on a new project called Marten that is meant to exploit Postgresql’s JSONB data as a full fledged document database for .Net development as a drop in replacement for RavenDb in our production environment. I think that I would say that our primary goal with Marten is improved stability and supportability, but maximizing performance and throughput is a very close second in the priority list.

This is my second update on Marten progress. From last week, also see Marten Development So Far.

So far, I’ve mostly been focusing on optimizing the SQL queries generated by the Linq support for faster fetching. I’ve been experimenting with a few different query modes for the SQL generation based on what fields or properties you’re trying to search on:

  1. By default in the absence of any explicit configuration, Marten tries to use the “jsonb_to_record” function with a LATERAL join approach to optimize queries against members on the root of the document.
  2. You can also force Marten to only use basic Postgresql JSON locators to generate the where clauses in the SQL statements
  3. Finally, if you know that your application will be frequently querying a document type against a certain member, Marten can use a “searchable” field such that it duplicates that data in a normal database field and searches directly against that database field. This mechanism will clearly slow down your inserts and take up somewhat more storage space, but the numbers I’m about to display don’t lie, this is very clearly the fastest way to optimize queries using Marten (so far).

I’ve also experimented with both the Newtonsoft.Json serializer and the faster, but less flexible Jil serializer. Again, the numbers are pretty clear that for bigger result sets, Jil is much faster (NetJSON was a complete bust for me when I tried it). So far I’ve been able to keep Marten serializer-agnostic and I can easily see times when you’d have to opt for Newtonsoft’s flexibility.

Default jsonb_to_record/LATERAL JOIN

Using this approach, the SQL generated is:

select d.data from mt_doc_target as d, LATERAL jsonb_to_record(d.data) as l("Date" date) where l."Date" = :arg0

Json Locators Only

While you can configure this behavior on a field by field basis, the quickest way is to just set the default document behavior:

public class JsonLocatorOnly : MartenRegistry
{
    public JsonLocatorOnly()
    {
        // This can also be done with attributes
        For<Target>().PropertySearching(PropertySearching.JSON_Locator_Only);
    }
}

With this setting, the generated SQL is:

select d.data from mt_doc_target as d where CAST(d.data ->> 'Date' as date) = :arg0

Searchable, Duplicated Field

Again, to configure this option, I used this code:

public class DateIsSearchable : MartenRegistry
{
    public DateIsSearchable()
    {
        // This can also be done with attributes
        For<Target>().Searchable(x => x.Date);
    }
}

When I do this, the table for the Target type has an additional field called “date” that will get the value of the Target.Date property every time a Target object is inserted or updated in the database.

The resulting SQL is:

select d.data from mt_doc_target as d where d.date = :arg0

The Performance Results

I created the table below by generating randomized data, then trying to search by a DateTime field using three different mechanisms:

var theDate = DateTime.Today.AddDays(3);
var queryable = session.Query<Target>().Where(x => x.Date == theDate);

In all cases, I used the same sample data for the document count and took an average of running the same query five times after throwing out an initial attempt where Postgresql seemed to be “warming up” the JSONB data.

Serializer: JsonNetSerializer

Query Type 1K 10K 100K 1M
JSON Locator Only 9.6 75.2 691.2 9648
jsonb_to_record + lateral join 10 93.6 922.6 12091.2
searching by duplicated field 2.4 15 169.6 2777.8

Serializer: JilSerializer

Query Type 1K 10K 100K 1M
JSON Locator Only 6.8 61 594.8 7265.6
jsonb_to_record + lateral join 8.4 86.6 784.2 9655.8
searching by duplicated field 1 8.8 115.4 2234.2

To be honest, I expected the JSONB_TO_RECORD + LATERAL JOIN mechanism to be faster than the JSON locator only approach, but I need to go back and try to add some indexes because that’s supposed to be the benefit of using JSONB_TO_RECORD to avoid the object casts that inevitably defeat indexes. I’d be happy to get some Postgresql gurus to weigh in here if there are any reading this.

If you’re curious to see my mechanism for recording this data, see the performance_tuning code file in GitHub.

Bulk Loading Documents

From time to time (testing or data migrations maybe) you’ll have some need to very rapidly load a large set of documents into your database. I added a feature this morning to Marten that exploits Postgresql’s COPY feature supported by Npgsql:

public void load_with_small_batch()
{
    // This is just creating some randomized
    // document data
    var data = Target.GenerateRandomData(100).ToArray();

    // Load all of these into a Marten-ized database
    theSession.BulkLoad(data);

    // And just checking that the data is actually there;)
    theSession.Query<Target>().Count().ShouldBe(data.Length);
    theSession.Load<Target>(data[0].Id).ShouldNotBeNull();
}

Behind the scenes, Marten is using code generation at runtime and compiled by Roslyn to do the bulk loading as efficiently as possible without any hit from using reflection:

public void Load(ISerializer serializer, NpgsqlConnection conn, IEnumerable documents)
{
    using (var writer = conn.BeginBinaryImport("COPY mt_doc_target(id, data) FROM STDIN BINARY"))
    {
        foreach (var x in documents)
        {
            writer.StartRow();
            writer.Write(x.Id, NpgsqlDbType.Uuid);
            writer.Write(serializer.ToJson(x), NpgsqlDbType.Jsonb);
        }
    }
}

Do note that the code generation mechanism is smart enough to also add any fields or properties of the document type that are marked as duplicated for searching.

Other Outstanding Optimization Tasks 

  • Optimize the mechanics for applying all the changes in a unit of work. I’m hoping that we can do something to reduce the number of network round trips between the application and the postgresql server. My fallback approach is going to be to use a custom PLV8 sproc, but not until we exhaust other possibilities with the Npgsql library.
  • I want some mechanism for queuing up queries and submitting them in one network round trip
  • The ability to make a named, reusable Linq query so you can reuse the underlying ADO.Net command generated from parsing the Linq expression without having to go through all the Expression parsing gymnastics on each usage
  • Really more for scalability than performance, but we’ll get around to asynchronous query methods. I’m just not judging that to be a critical path item right now.
  • It’s probably minor in the grand scheme of things, but the actual Linq expression to Sql query generation is grotesque in how it concatenates strings

Feel very free to make suggestions and other feedback on these items;-)

Marten Development So Far (Postgresql as Doc Db)

Last week I mentioned that I had started a new OSS project called “Marten” that aims to allow .Net developers treat Postgresql 9.5 (we’re using the new “upsert” functionality ) as a document database using Postgresql’s JSONB data type. We’ve already had some interest and feedback on Github and the Gitter room — plus links to at least three other ongoing efforts to do something similar with Postgresql that I’m interpreting as obvious validation for the basic idea.

Please feel very free to chime in on the approach or requirements here or Github or Gitter. We’re going to proceed with this project regardless at work, but I’d love to see it also be a viable community project with input from outside our little development organization.

What’s Already Done

I’d sum up the Marten work as “so far, so good”. If you look closely into the Marten code, do know that I have been purposely standing the functionality with simple mechanics and naive implementations. My philosophy here is to get the functionality up with good test coverage before starting any heavy optimization work.

As of now:

  • Our thought is that the main service facade to Marten is the IDocumentSession interface that very closely mimics the same interface in RavenDb. This work is for my day job at Extend Health, and our immediate goal is to move systems off of RavenDb early next year, so I think that design decision is pretty understandable. That doesn’t mean that that’ll be the only way to interact with Marten in the long run.
  • In the “development mode”, Marten is able to create database tables and an “upsert” stored procedure for any new document type it encounters in calls to the IDocumentSession.
  • The real DocumentSession facade can store documents, load documents by either a single or array of id’s, and delete documents by the same.
  • DocumentSession implements a “unit of work” with similar usage to RavenDb’s.
  • You can completely bypass the Linq provider I’m describing in the next section and just use raw SQL to fetch documents
  • A DocumentCleaner service that you can use to tear down document data or even the schema objects that Marten builds inside of automated testing harnesses

Linq Support

I don’t think I need to make the argument that Marten is going to be more usable and definitely more popular if it has decent Linq support. While I was afraid that building a Linq provider on top of the Postgresql JSON operators was going to be tedious and hard, the easy to use Relinq library has made it just “tedious.”

As early as next week I’m going to start working over the Linq support and the SQL it generates to try to optimize searching.

The Linq support hangs off of the IDocumentSession.Query<T>() method like so:

        public void query()
        {
            theSession.Store(new Target{Number = 1, DateOffset = DateTimeOffset.Now.AddMinutes(5)});
            theSession.Store(new Target{Number = 2, DateOffset = DateTimeOffset.Now.AddDays(1)});
            theSession.Store(new Target{Number = 3, DateOffset = DateTimeOffset.Now.AddHours(1)});
            theSession.Store(new Target{Number = 4, DateOffset = DateTimeOffset.Now.AddHours(-2)});
            theSession.Store(new Target{Number = 5, DateOffset = DateTimeOffset.Now.AddHours(-3)});

            theSession.SaveChanges();

            theSession.Query<Target>()
                .Where(x => x.DateOffset > DateTimeOffset.Now).ToArray()
                .Select(x => x.Number)
                .ShouldHaveTheSameElementsAs(1, 2, 3);
        }

For right now, the Linq IQueryable support includes:

  • IQueryable.Where() support with strings, int’s, long’s, decimal’s, DateTime’s, enumeration values, and boolean types.
  • Multiple or chained Where().Where().Where() clauses like you might use when you’re calculating optional where clauses or letting multiple pieces of code add additional filters
  • “&&” and “||” operators in the Where() clauses
  • Deep nested properties in the Where() clauses like x.Address.City == “Austin”
  • First(), FirstOrDefault(), Single(), and SingleOrDefault() support for the IQueryable
  • Count() and Any() support
  • Contains(), StartsWith(), and EndsWith() support for string values — but it’s case sensitive right now. Case-insensitive searches are probably going to be an “up-for-grabs” task;)
  • Take() and Skip() support for paging
  • OrderBy() / ThenBy() / OrderByDescending() support

Right now, I’m using my audit of our largest system at work that uses RavenDb to guide and prioritize the Linq support. The only thing missing for us is searching within child collections of a document.

What we’re missing right now is:

  • Projections via IQueryable.Select(). Right now you have to do IQueryable.ToArray() to force the documents into memory before trying to use Select() projections.
  • Last() and LastOrDefault()
  • A lot of things I probably hadn’t thought about at all;-)

Using Roslyn for Runtime Code Compilation

We’ll see if this turns out to be a good idea or not, but as of today Marten is using Roslyn to generate strategy classes that “know” how to build database commands for updating, deleting, and loading document data for each document type instead of using Reflection or IL emitting or compiling Expression’s on the fly. Other than the “warm up” performance hit on doing the very first compilation, this is working smoothly so far. We’ll be watching it for performance. I’ll blog about that separately sometime soon-ish.

Next Week: Get Some Data and Optimize!

My focus for Marten development next week is on getting a non-trivial database together and working on pure optimization. My thought is to grab data from Github using Ocktokit.Net to build a semi-realistic document database of users, repositories, and commits from all my other OSS projects. After that, I’m going to try out:

  • Using GIN indexes against the jsonb data to see how that works
  • Trying to selectively duplicate data into normal database fields for lightweight sql searches and indexes
  • Trying to use Postgresql’s jsonb_to_record functionality inside of the Linq support to see if that makes searches faster
  • I’m using Newtonsoft.Json as the JSON serializer right now thinking that I’d want the extra flexibility later, but I want to try out Jil too for the comparison
  • After the SQL generation settles down, try to clean up the naive string concatenation going on inside of the Linq support
  • Optimize the batch updates through DocumentSession.SaveChanges(). Today it’s just making individual sql commands in one transaction. For some optimization, I’d like to at least try to make the updates happen in fewer remote calls to the database. My fallback plan is to use a *gasp* stored procedure using postgresql’s PLV8 javascript support to take any number of document updates or deletions as a single json payload.

That list above is enough to keep me busy next week, but there’s more in the open Github issue list and we’re all ears about whatever we’ve missed, so feel free to add more feature requests or comment on existing issues.

Why “Marten?”

One of my colleagues was sneering at the name I was using, so I googled for “natural predators of ravens” and the marten was one of the few options, so we ran with it.