Marten Gets Serious with v0.8

EDIT: DotNetRocks just published a new episode about Marten.

I just pushed a new v0.8 version of Marten to Nuget.org with a handful of new features, a lot of bugfixes, and a slew of refinements in response to feedback from our early adopters. I feel like this release added some polish to Marten, and I largely attribute that to how helpful our community has been in making suggestions and contributing pull requests based on their early usage.

The documentation site has been updated to reflect the new changes and additions to Marten for v0.8. The full list of changes and bug fixes for v0.8 is available in the GitHub issue history for this milestone.

Release Highlights

  1. Marten got a lot more sophisticated about how it updates schema objects in the underlying Postgresql database. See the changes to AutoCreateSchemaObjects and the ability to update a table for additional “searchable” columns instead of dropping and recreating tables in the documentation.
  2. Bulk Deletes” by a query without first having to load documents before deleting them
  3. The Linq parsing support got worked over pretty hard, including some new linq query support extensibility.
  4. New mechanisms for instrumentation and diagnostics within Marten. I’ll have a blog post next week about how we’re going to integrate with this at work for profiling database activity in our web and service bus apps.
  5. A lot of work to integrate the forthcoming event store functionality. I left this out of the documentation updates for now, but there’ll be a lot more about this later.

 

What’s Up Next?

There’s some preliminary planning for Marten v0.9 on GitHub. The very next feature I’m tackling is going to be our equivalent to RavenDb’s Include() feature (but I think we’re going to do it differently). Other than that, I think the theme of the next release is going to be addressing the “read side” of Marten applications with view projections by finally supporting “Select()” in the Linq support and some mix of Javascript or .Net projections.

A note on the versioning

While I am a big believer in semantic versioning (or at least I think it’s far better than having nothing), Marten is still pre 1.0 and a few public API’s did change. We haven’t discussed a timeline for the all important “1.0” release, but I’m hopeful that that can happen by this summer (2016).

An Example of the Open/Closed Principle in Action

I saw someone on Twitter this month say that they’ve never really understood the Open/Closed Principle (OCP, the “O” in S.O.L.I.D.). I think it’s a very important concept in software architecture, but the terse statement maybe doesn’t make it too clear what it’s really about:

software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification

There are some other ways to interpret the Open/Closed Principle (the Wikipedia article about it talks about inheritance which I think is short sighted), but my restatement of the OCP would be:

“structure code such that additional functionality can be mostly written in all new code modules with no, or at least minimal, changes to existing code modules”

The key point is that it’s much less risky and usually easier to write brand new code — especially if the new code has minimal coupling to old code — than it is to modify existing code. Or to put it another way, can I continually add all new functionality to my existing system without causing a lot of regression bugs?

It’s not just risk either, it’s generally easier to understand complicated new code written on a blank canvas than it is to open up an existing code file and find the right places to insert your changes without breaking the old functionality.

 

Some examples of systems from my career that most definitely did not follow OCP might better illustrate why you’d care about OCP:

  1. Dynamic web page application that was effectively written in one single VB6 class. Every single addition or fix to the application meant editing that one single file, and very frequently broke existing functionality
  2. A large shipping application where every bit of routing logic for box positions within a factory floor were coded in a single, giant switch statement that shared a lot of global state. Again, changes to routing logic commonly broke existing functionality. The cost of regression testing this routing logic slowed down the team in charge of this system considerably.
  3. COBOL style batch processes coded in giant stored procedures with lots of global state
  4. Naive usages of Redux in Javascript could easily lead to the massive switch statement problem where all kinds of unrelated code changes involve the same central file

An OCP Example: Linq Provider Extensibility in Marten

We’ve been building (and building and building) Linq query support into Marten. Linq support is the type of problem I refer to as “Permutation Hell,” meaning that there’s an almost infinite supply of “what about querying by this type/operator/method call?” use cases. Recently, one of our early adopters asked for Linq support for querying by a range of values like this:

// Find all SuperUser documents where the role is "Admin", 
// "Supervisor", or "Director"
var users = theSession.Query<SuperUser>()
    .Where(x => x.Role.IsOneOf("Admin", "Supervisor", "Director"));

In the case above, IsOneOf() is a custom extension method in Marten that just means “the value of this property/field should be any of these values”.

I thought that was a great idea, but at the time the Linq provider code in Marten was effectively a “Spike-quality” blob of if/then branching logic. Extending the Linq support meant tracing through the largely procedural code to find the right spot to insert the new parsing logic. I think recognizing this, our early adopter also suggested making an extensibility point so that users and contributors could easily author and add new method parsing to the Linq provider.

What we really needed was a little bit of Open/Closed structuring so that additional method call parsing for things like IsOneOf() could be written in brand new code instead of trying to ram more branching logic into the older MartenExpressionParser class (the link is to an older version;)).

Looking through the old Linq parsing code, I realized there was an opportunity to abstract the responsibility for handling a call to a method in Linq queries behind this interface from Marten:

    /// <summary>
    /// Models the Sql generation for a method call
    /// in a Linq query. For example, map an expression like Where(x => x.Property.StartsWith("prefix"))
    /// to part of a Sql WHERE clause
    /// </summary>
    public interface IMethodCallParser
    {
        /// <summary>
        /// Can this parser create a Sql where clause
        /// from part of a Linq expression that calls
        /// a method
        /// </summary>
        /// <param name="expression"></param>
        /// <returns></returns>
        bool Matches(MethodCallExpression expression);

        /// <summary>
        /// Creates an IWhereFragment object that Marten
        /// uses to help construct the underlying Sql
        /// command
        /// </summary>
        /// <param name="mapping"></param>
        /// <param name="serializer"></param>
        /// <param name="expression"></param>
        /// <returns></returns>
        IWhereFragment Parse(
            IDocumentMapping mapping, 
            ISerializer serializer, 
            MethodCallExpression expression
            );
    }

The next step was to pull out strategy classes implementing this interface for the method we already supported like String.Contains()String.StartsWith(), or String.EndsWith(). Inside of the Linq provider support, the next step was to select the right strategy for a method expression and use that to help create the Sql string:

protected override Expression VisitMethodCall(MethodCallExpression expression)
{
    var parser = _parent._options.Linq.MethodCallParsers.FirstOrDefault(x => x.Matches(expression)) 
        ?? _parsers.FirstOrDefault(x => x.Matches(expression));

    if (parser != null)
    {
        var @where = parser.Parse(_mapping, _parent._serializer, expression);
        _register.Peek()(@where);

        return null;
    }


    throw new NotSupportedException($"Marten does not (yet) support Linq queries using the {expression.Method.DeclaringType.FullName}.{expression.Method.Name}() method");
}

Once that was in place, I could build out the IsOneOf() search functionality by building an all new class implementing that IMethodCallParser interface described above. To wire up the new strategy, it was a one line change to the existing Linq code:

        // The out of the box method call parsers
        private static readonly IList<IMethodCallParser> _parsers = new List<IMethodCallParser>
        {
            new StringContains(),
            new EnumerableContains(),
            new StringEndsWith(),
            new StringStartsWith(),

            // Added
            new IsOneOf()
        };

So yes, I did have to “open” up the existing code to make a small change to enable the new functionality, but at least it was a low impact change with minimal risk.

I didn’t show it in this post, but there is also a new way to add your own implementations of IMethodCallParser to a Marten document store. I’m not entirely sure how many folks will take advantage of that extensibility point, but the structural refactoring I did to enable this story should make it much easier for us to continue to refine our Linq support.

My example is yet another example of using plugin strategies to demonstrate the Open/Closed Principle, but I think the real emphasis should be on compositional designs. Even without formal plugin patterns or IoC containers or configuration strategies, using the OCP to guide your design thinking about how to minimize the risk of later changes is still valuable.

 

 

 

 

 

Batch Queries with Marten

Marten v0.7 was published just under two weeks ago, and one of the shiny new features was the batched query model with let’s say a trial balloon syntax that was shot down pretty fast in the Marten Gitter room (I wasn’t happy with it either). To remedy that, we pushed a new Nuget this morning (v0.7.1) that has a new, streamlined syntax for the batched query and updated the batched query docs to match.

So here’s the problem it tries to solve, say you have an HTTP endpoint that needs to aggregate several different sources of document data into a single, aggregated JSON message back to your web client (this is a common scenario in a large application at my work that is going to be converted to Marten shortly). To speed up that JSON endpoint, you’d like to be able to batch up those queries into a single call to the underlying Postgresql database, but still have an easy way to get at the results of each query later. This is where Marten’s batch query functionality comes in as demonstrated below:

// Start a new IBatchQuery from an active session
var batch = theSession.CreateBatchQuery();

// Fetch a single document by its Id
var user1 = batch.Load<User>("username");

// Fetch multiple documents by their id's
var admins = batch.LoadMany<User>().ById("user2", "user3");

// User-supplied sql
var toms = batch.Query<User>("where first_name == ?", "Tom");

// Query with Linq
var jills = batch.Query<User>().Where(x => x.FirstName == "Jill").ToList();

// Any() queries
var anyBills = batch.Query<User>().Any(x => x.FirstName == "Bill");

// Count() queries
var countJims = batch.Query<User>().Count(x => x.FirstName == "Jim");

// The Batch querying supports First/FirstOrDefault/Single/SingleOrDefault() selectors:
var firstInternal = batch.Query<User>().OrderBy(x => x.LastName).First(x => x.Internal);

// Kick off the batch query
await batch.Execute();

// All of the query mechanisms of the BatchQuery return
// Task's that are completed by the Execute() method above
var internalUser = await firstInternal;
Debug.WriteLine($"The first internal user is {internalUser.FirstName} {internalUser.LastName}");

Using the batch query is a four step process:

  1. Start a new batch query by calling IDocumentSession.CreateBatchQuery()
  2. Define the queries you want to execute by calling the Query() methods on the batch query object. Each query operator returns a Task<T> object that you’ll use later to access the results after the query has completed (under the covers it’s just a TaskCompletionSource).
  3. Execute the entire batch of queries and await the results
  4. Access the results of each query in the batch, either by using the await keyword or Task.Result.

 

A Note on our Syntax vis a vis RavenDb

You might note that the Marten syntax is quite a bit different syntax-wise and even conceptually to RavenDb’s Lazy Query feature. While we originally started Marten with the idea that we’d stay very close to RavenDb’s API to make the migration effort less difficult, we’re starting to deviate as we see fit. In this particular case, I wanted the API to be more explicit about the contents and lifecycle of the batched query. In other cases like the forthcoming “Include Query” feature, we will probably stay very close to RavenDb’s syntax if we don’t have any better ideas or strong reason to deviate from the existing art.

 

A Note on “Living” Documentation

I’ve received a lot of criticism over the years for having inadequate, missing, or misleading documentation for the OSS projects I’ve ran. Starting with Storyteller 3.0 and StructureMap 4.0 last year and now Marten this year, I’ve been having some success using Storyteller’s static website generation to author technical documentation in a way that’s been easy to keep code samples and content up to date with changes to the underlying tool. In the case of the batched query syntax from Marten above, the code samples are pulled directly from the acceptance tests for the feature. As soon as I made the changes to the code, I was able to update the documentation online to reflect the new syntax from running a quick script and pushing to the gh-pages branch of the Marten repository. All told, it took me under a minute to refresh the content online.

New Features and Improvements in Marten 0.7

The Marten project was launched about 6 months ago as a proof of concept that we could really treat Postgresql as a document database, an event store, and a potential replacement for a problematic subsystem at work. Right now, Marten is starting to look like a potentially successful OSS project with an increasingly active and engaged community. If you’re interested in using Postgresql, Document Db, or event sourcing in .Net, you may want to check out Marten’s website or jump into the discussions in the Marten Gitter room.

Marten development has been proceeding much faster over the past couple weeks as a lot of useful feedback and pull requests are flowing in from early adopters and I’m able to dedicate quite a bit of time at work to Marten in preparation for us converting some of our applications over. Only a couple weeks after a pretty sizable v0.6 release, I was just able to upload a new Marten v0.7 nuget as well as publish updated documentation for the new changes.

While you can see the entire list of changes from the GitHub issue list for this milestone, the big, flashy changes are:

  1. After several related requests, the database connection is now “sticky” to an IDocumentSession and the underlying database connection is exposed off of the interface. Among other things, this change allows users to integrate Dapper usage inside the same transaction boundaries as Marten. This change also allows you to specify the isolation level of the underlying transaction. See the documentation for a sample usage of this new feature.
  2. You can opt into storing a hierarchy of document types as a single database table and logical document collection. See the documentation topic for information on using this feature.
  3. Batched queries for potentially improved performance if you need to make several database requests at one time.
  4. The results of Linq queries are integrated with Marten’s Identity Map features
  5. Improved Linq query support for child collections

In addition to the big ticket items above, Marten improved the internals of its asynchronous query methods (thanks to Daniel Marbach), the robustness of its decision making on when and when not to regenerate tables, and ability to use reserved Postgresql names as columns.

What’s next for Marten?

Right now the obvious consensus in the Marten community seems to be that we need to get serious with read side projection support, transformations, and some equivalent to RavenDb’s Include feature. Beyond that, I want to get some kind of instrumentation or logging story going and there’s a handful of “if only Marten had this *one* feature I could switch over” features in our issue list.

It’s not completely set yet, but the theoretical plans for the next v0.8 release are listed on GitHub.

If there’s any time soon, I’d like to restart some work on the event store half of Marten, but that has to remain a lower priority for me just based on what we think we need first at work.

Marten Takes a Big Step Forward with v0.6

EDIT: Nuget v0.6.1 is already up with some improvements to the async code in Marten. Hat tip to Daniel Marbach for his pull request on that one.

Marten is a new OSS project that seeks to turn Postgresql into a robust, usable document database (and an event store someday) for .Net development. There’s a recording of an internal talk I gave introducing Marten at work live on YouTube for more background.

Marten v0.6 just went live on nuget this afternoon. This turned into a pretty substantial release that I feel makes Marten much more robust, usable, and generally a lot closer to ready for production usage in bigger, more complicated systems.

This release came with substantial contributions from other developers and incorporates feedback from early adopters. I’d like to thank (in no particular order) Jens Pettersson, Corey Kaylor, Bojan Veljanovski, Jeff Doolittle, Phillip Haydon, and Evgeniy Kulakov for their contributions and feedback in this release.

What’s New:

You can see the complete set of changes from the v0.6 milestone on GitHub.

So, what’s next?

More than anything, I’m hoping to get more early adopters giving us feedback (and pull requests!) on what’s missing, what’s not easy to use, and where it needs to change. I think I’ll get the chance to try converting a large project from RavenDb to Marten soon that should help as well.

Feature wise, I think the next couple things up for a future v0.7 release would be:

  • Batched queries (futures)
  • Readside projections, but whether that’s going to be via Javascript, .Net transforms, or both is yet to be determined
  • Using saved queries to avoid unnecessarily taking the hit of Linq expression parsing

“Introduction to Marten” Video

I gave an internal talk today at our Salt Lake City office on Marten that we were able to record and post publicly. I discussed why Postgresql, why or when to choose a document database over a relational database, what’s already done in Marten, and where it still needs to go.

And of course, if you just wanna know what Marten is, the website is here.

Any feedback is certainly welcome here or in the Marten Gitter room.

Today I learned that the only thing worse than doing a big, important talk on not enough sleep is doing two talks and a big meeting on technical strategy on the same day.

Marten is Ready for Early Adopters

I’ve been using RavenDb for development over the past several years and I’m firmly convinced that there’s a pretty significant productivity advantage to using document databases over relational databases for many systems. For as much as I love many of the concepts and usability of RavenDb, it isn’t running very successfully at work and it’s time to move our applications to something more robust. Fortunately, we’ve been able to dedicate some time toward using Postgresql as a document database. We’ve been able to do this work as a new OSS project called Marten. Our hope with Marten has been to retain the development time benefits of document databases (along with an easy migration path away from RavenDb) with a robust technological foundation — and even I’ll admit that it will occasionally be useful to fall back to using Postgresql as a relational database where that is still advantageous.

I feel like Marten is at a point where it’s usable and what we really need most is some early adopters who will kick the tires on it, give some feedback about how well it works, what’s missing that would make it easier to use, and how it’s performing in their systems. Fortunately, as of today, Marten now has (drum role please):

And of course, the Marten Gitter room is always open for business.

An Example Quickstart

To get started with Marten, you need two things:

  1. A Postgresql database schema (either v9.4 or v9.5)
  2. The Marten nuget installed into your application

After that, the quickest way to get up and running is shown below with some sample usage:

var store = DocumentStore.For("your connection string");

Now you need a document type that will be persisted by Marten:

    public class User
    {
        public Guid Id { get; set; }
        public string FirstName { get; set; }
        public string LastName { get; set; }
        public bool Internal { get; set; }
        public string UserName { get; set; }
    }

As long as a type can be serialized and deserialized by the JSON serializer of your choice and has a public field or property called “Id” or “id”, Marten can persist and load it back later.

To persist and load documents, you use the IDocumentSession interface:

    using (var session = store.LightweightSession())
    {
        var user = new User {FirstName = "Han", LastName = "Solo"};
        session.Store(user);

        session.SaveChanges();
    }

 

 

 

 

Optimizing Marten Part 2

This is an update to an earlier blog post on optimizing for performance in Marten. Marten is a new OSS project I’m working on that allows .Net applications to treat the Postgresql database as a document database. Our hope at work is that Marten will be a more performant and easier to support replacement in our ecosystem for RavenDb (and possibly a replacement event store mechanism inside of the applications that use event sourcing, but that’s going to come later).

Before we should think about using Marten for real, we’re undergoing some efforts to optimize the performance both in reading and writing data from Postgresql.

Optimizing Queries with Indexes

In my previous post, my former colleague Joshua Flanagan suggested using the Postgresql containment operator and gin indexes as part of my performance comparisons. After adding some ability to define database indexes for  Marten document types like this:

public class ContainmentOperator : MartenRegistry
{
    public ContainmentOperator()
    {
        // For persisting a document type called 'Target'
        For<Target>()

            // Use a gin index against the json data field
            .GinIndexJsonData()

            // directs Marten to try to use the containment
            // operator for querying against this document type
            // in the Linq support
            .PropertySearching(PropertySearching.ContainmentOperator);
    }
}

and like this for indexing what we’re calling “searchable” fields where Marten duplicates some element of a document into a separate database column for optimized searching:

public class DateIsSearchable : MartenRegistry
{
    public DateIsSearchable()
    {
        // This can also be done with attributes
        // This automatically adds a "BTree" index
        For<Target>().Searchable(x => x.Date);
    }
}

As of now, when you choose to make a field or property of a document “searchable”, Marten is automatically adding a database index to that column on the document storage table. By default, the index is the standard Postgresql btree index, but you do have the ability to override how the index is created.

Now that we have support for querying using the containment operator and support for defining indexes, I reran the query performance tests and updated the results with some new data:

Serializer: JsonNetSerializer

Query Type 1K 10K 100K
JSON Locator Only 7 77.2 842.4
jsonb_to_record + lateral join 9.4 88.6 1170.4
searching by duplicated field 1 16.4 135.4
searching by containment operator 4.6 14.8 132.4

Serializer: JilSerializer

Query Type 1K 10K 100K
JSON Locator Only 6 54.8 827.8
jsonb_to_record + lateral join 8.6 76.2 1064.2
searching by duplicated field 1 6.8 64
searching by containment operator 4 7.8 66.8

Again, searching by a field that is duplicated as a simple database column with a btree index is clearly the fastest approach. The containment operator plus gin index comes in second, and may be the best choice when you will have to issue many different kinds of queries against the same document type. Based on this data, I think that we’re going to make the containment operator be the preferred way of querying json documents, but fallback to using the json locator approach for all other query operators besides equality tests.

I still think that we have to ship with Newtonsoft.Json as our default json serializer because of F# and polymorphism concerns among other things, but if you can get away with it for your document types, Jil is clearly much faster.

There is some conversation in the Marten Gitter room about possibly adding gin indexes to every document type by default, but I think we first need to pay attention to the data in the next section:

 

Insert Timings

The querying is definitely important, but we certainly want the write side of Marten to be fast too. We’ve had what we call “BulkInsert” support using Npgsql & Postgresql’s facility for bulk copying. Recently, I’ve changed Marten’s internal unit of work class to issue all of its delete and “upsert” commands in one single ADO.Net DbCommand to try to execute multiple sql statements in a single network round trip.

My best friend the Oracle database guru (I’ll know if he reads this because he’ll be groaning about the Oracle part;)) suggested that this approach might not matter against issuing multiple ADO.Net commands against the same stateful transaction and connection, but we were both surprised by how much difference batching the SQL commands turned out to be.

To better understand the impact on insert timing using our bulk insert facility, the new batched update mechanism, and the original “ADO.Net command per document update” approach, I ran a series of tests that tried to insert 500 documents using each technique.

Because we also need to understand the implications on insertion and update timing of using the searchable, duplicated fields and gin indexes (there is some literature in the Postgresql docs stating that gin indexes could be expensive on the write side), I ran each permutation of update strategy against three different indexing strategies on the document storage table:

  1. No indexes whatsoever
  2. A duplicated field with a btree index
  3. Using a gin index against the JSON data column

And again, just for fun, I used both the Newtonsoft.Json and Jil serializers to also understand the impact that they have on performance.

You can find the code I used to make these tables in GitHub in the insert_timing class.

Using Newtonsoft.Json as the Serializer

Index Bulk Insert Batch Update Command per Document
No Index 62 149 244
Duplicated Field w/ Index 53 152 254
Gin Index on Json 96 186 300

 

Using Jil as the Serializer

Index Bulk Insert Batch Update Command per Document
No Index 47 134 224
Duplicated Field w/ Index 57 151 245
Gin Index on Json 79 180 270

As you can clearly see, the new batch update mechanism looks to be a pretty big win for performance over our original, naive “command per document” approach. The only downside is that this technique has a certain ceiling insofar as how many or how large the documents can be before the single command exceeds technical limits. For right now, I think I’d like to simply beat that problem with documentation pushing users to using the bulk insert mechanism for large data sets. In the longer term, we’ll throttle the batch update by paging updates into some to be determined number of document updates at a time.

The key takeaway for me just reinforces the very first lesson I had drilled into me about software performance: network round trips are evil. We are certainly reducing the number of network round trips between our application and the database server by utilizing the command batching.

You can also see that using a gin index slows down the document updates considerably. I think the only good answer to users is that they’ll have to do performance testing as always.

 

Other Optimization Things

  • We’ve been able to cutdown on Reflection hits and dynamic runtime behavior by using Roslyn as a crude metaprogramming mechanism to just codegen the document storage code.
  • Again in the theme of reducing network round trips, we’re going to investigate being able to batch up deferred queries into a single request to the Postgresql database.
  • We’re not sure about the details yet, but we’ll be investigating approaches for using asynchronous projections inside of Postgresql (maybe using Javascript running inside of the database, maybe .Net code in an external system, maybe both approaches).
  • I’m leaving the issues out in http://up-for-grabs.net, but we’ll definitely add the ability to just retrieve the raw JSON so that HTTP endpoints could stream data to clients without having to take the unnecessary hit of deserializing to a .Net type just to immediately serialize right back to JSON for the HTTP response. We’ll also support a completely asynchronous querying and update API for maximum scalability.

 

Using Roslyn for Runtime Code Generation in Marten

I’m using Roslyn to dynamically compile and load assemblies built at runtime from generated code in Marten and other than some concern over the warmup time, it’s been going very well so far.

Like so many other developers with more cleverness than sense, I’ve spent a lot of time trying to build Hollywood Principle style frameworks that try to dynamically call application code at runtime through Reflection or some kind of related mechanism. Reflection itself has traditionally been the easiest mechanism to use in .Net to create dynamic behavior at runtime, but it can be a performance problem, especially if you use it naively.

A Look Back at What Came Before…

Taking my own StructureMap IoC tool as an example, over the years I’ve accomplished dynamic runtime behavior in a couple different ways:

  1. Using IL directly using Reflection.Emit from the original versions through StructureMap 2.5. Working with IL is just barely a higher abstraction than assembly code and I don’t recommend using that if your goal is maintainability or making it easy for other developers to work in your code. I don’t miss generating IL by hand whatsoever. For those of you reading this and saying “pfft, IL isn’t so bad if you just understand how it works…”, my advice to you is to immediately go outside and get some fresh air and sunshine because you clearly aren’t thinking straight.
  2. From StructureMap 2.6 I crudely used the trick of building Expression trees representing what I needed to do, then compiling those Expression trees into objects of the right Func or Action signatures. This approach is easier – at least for me – because the Expression model is much closer semantically to the actual code you’re trying to mimic than the stack-based IL.
  3. From StructureMap 3.* on, there’s a much more complex dynamic Expression compilation model that’s robust enough to call constructor functions, setter properties, thread in interception, and surround all of that with try/catch logic for expressive exception messages and pseudo stack traces.

The current dynamic Expression approach in the StructureMap 3/4 internals is mostly working out well, but I barely remember how it works and it would take me a good day to just to get back into that code if I ever had to change something.

What if instead we could just work directly in plain old C# that we largely know and understand, but somehow get that compiled at runtime instead? Well, thanks to Roslyn and its “compiler as a service”, we now can.

I’ve said before that I want to eventually replace the Expression compilation with the Roslyn code compilation shown in this post, but I’m not sure I’m ambitious enough to mess with a working project.

How Marten uses Roslyn Runtime Generation 

As I explained in my last blog post, Marten generates some “glue code” to connect a document object to the proper ADO.Net command objects for loading, storing, or deleting. For each document class, Marten generates an IDocumentStorage class with this signature:

public interface IDocumentStorage
{
    NpgsqlCommand UpsertCommand(object document, string json);
    NpgsqlCommand LoaderCommand(object id);
    NpgsqlCommand DeleteCommandForId(object id);
    NpgsqlCommand DeleteCommandForEntity(object entity);
    NpgsqlCommand LoadByArrayCommand(TKey[] ids);
    Type DocumentType { get; }
}

In the test library, we have a class I creatively called “Target” that I’ve been using to test how Marten handles various .Net Types and queries. At runtime, Marten generates a class called TargetDocumentStorage that implements the interface above. Part of the generated code — modified by hand to clean up some extraneous line breaks and added comments — is shown below:

using Marten;
using Marten.Linq;
using Marten.Schema;
using Marten.Testing.Fixtures;
using Marten.Util;
using Npgsql;
using NpgsqlTypes;
using Remotion.Linq;
using System;
using System.Collections.Generic;

namespace Marten.GeneratedCode
{
    public class TargetStorage : IDocumentStorage, IBulkLoader, IdAssignment
    {
        public TargetStorage()
        {

        }

        public Type DocumentType => typeof (Target);

        public NpgsqlCommand UpsertCommand(object document, string json)
        {
            return UpsertCommand((Target)document, json);
        }

        public NpgsqlCommand LoaderCommand(object id)
        {
            return new NpgsqlCommand("select data from mt_doc_target where id = :id").WithParameter("id", id);
        }

        public NpgsqlCommand DeleteCommandForId(object id)
        {
            return new NpgsqlCommand("delete from mt_doc_target where id = :id").WithParameter("id", id);
        }

        public NpgsqlCommand DeleteCommandForEntity(object entity)
        {
            return DeleteCommandForId(((Target)entity).Id);
        }

        public NpgsqlCommand LoadByArrayCommand(T[] ids)
        {
            return new NpgsqlCommand("select data from mt_doc_target where id = ANY(:ids)").WithParameter("ids", ids);
        }

        // I configured the "Date" field to be a duplicated/searchable field in code
        public NpgsqlCommand UpsertCommand(Target document, string json)
        {
            return new NpgsqlCommand("mt_upsert_target")
                .AsSproc()
                .WithParameter("id", document.Id)
                .WithJsonParameter("doc", json).WithParameter("arg_date", document.Date, NpgsqlDbType.Date);
        }

        // This Assign() method would use a HiLo sequence generator for numeric Id fields
        public void Assign(Target document)
        {
            if (document.Id == System.Guid.Empty) document.Id = System.Guid.NewGuid();
        }

        public void Load(ISerializer serializer, NpgsqlConnection conn, IEnumerable documents)
        {
            using (var writer = conn.BeginBinaryImport("COPY mt_doc_target(id, data, date) FROM STDIN BINARY"))
            {
                foreach (var x in documents)
                {
                    writer.StartRow();
                    writer.Write(x.Id, NpgsqlDbType.Uuid);
                    writer.Write(serializer.ToJson(x), NpgsqlDbType.Jsonb);
                    writer.Write(x.Date, NpgsqlDbType.Date);
                }
            }
        }
    }
}

Now that you can see what code I’m generating at runtime, let’s move on to a utility for generating the code.

SourceWriter

SourceWriter is a small utility class in Marten that helps you write neatly formatted, indented C# code. SourceWriter wraps a .Net StringWriter for efficient string manipulation and provides some helpers for adding namespace using statements and tracking indention levels for you. After experimenting with some different usages, I mostly settled on using the Write(text) method that allows you to provide a section of code as a multi-line string. The TargetDocumentStorage code I showed above is generated from within a class called DocumentStorageBuilder with a call to the SourceWriter.Write() method shown below:

            writer.Write(
                $@"
BLOCK:public class {mapping.DocumentType.Name}Storage : IDocumentStorage, IBulkLoader<{mapping.DocumentType.Name}>, IdAssignment<{mapping.DocumentType.Name}>

{fields}

BLOCK:public {mapping.DocumentType.Name}Storage({ctorArgs})
{ctorLines}
END

public Type DocumentType => typeof ({mapping.DocumentType.Name});

BLOCK:public NpgsqlCommand UpsertCommand(object document, string json)
return UpsertCommand(({mapping.DocumentType.Name})document, json);
END

BLOCK:public NpgsqlCommand LoaderCommand(object id)
return new NpgsqlCommand(`select data from {mapping.TableName} where id = :id`).WithParameter(`id`, id);
END

BLOCK:public NpgsqlCommand DeleteCommandForId(object id)
return new NpgsqlCommand(`delete from {mapping.TableName} where id = :id`).WithParameter(`id`, id);
END

BLOCK:public NpgsqlCommand DeleteCommandForEntity(object entity)
return DeleteCommandForId((({mapping.DocumentType.Name})entity).{mapping.IdMember.Name});
END

BLOCK:public NpgsqlCommand LoadByArrayCommand(T[] ids)
return new NpgsqlCommand(`select data from {mapping.TableName} where id = ANY(:ids)`).WithParameter(`ids`, ids);
END


BLOCK:public NpgsqlCommand UpsertCommand({mapping.DocumentType.Name} document, string json)
return new NpgsqlCommand(`{mapping.UpsertName}`)
    .AsSproc()
    .WithParameter(`id`, document.{mapping.IdMember.Name})
    .WithJsonParameter(`doc`, json){extraUpsertArguments};
END

BLOCK:public void Assign({mapping.DocumentType.Name} document)
{mapping.IdStrategy.AssignmentBodyCode(mapping.IdMember)}
END

BLOCK:public void Load(ISerializer serializer, NpgsqlConnection conn, IEnumerable<{mapping.DocumentType.Name}> documents)
BLOCK:using (var writer = conn.BeginBinaryImport(`COPY {mapping.TableName}(id, data{duplicatedFieldsInBulkLoading}) FROM STDIN BINARY`))
BLOCK:foreach (var x in documents)
writer.StartRow();
writer.Write(x.Id, NpgsqlDbType.{id_NpgsqlDbType});
writer.Write(serializer.ToJson(x), NpgsqlDbType.Jsonb);
{duplicatedFieldsInBulkLoadingWriter}
END
END
END

END

");
        }

There’s a couple things to note about the code generation above:

  • String interpolation makes this so much easier than I think it would be with just string.Format(). Thank you to the C# 6 team.
  • Each line of code is written to the underlying StringWriter with the level of indention added to the left by SourceWriter itself
  • The “BLOCK” prefix directs SourceWriter to add an opening brace “{” to the next line, then increment the indention level
  • The “END” text directs SourceWriter to decrement the current indention level, then write a closing brace “}” to the next line and a blank line after that.

Now that we’ve got ourselves some generated code, let’s get Roslyn involved to compile it and actually get at an object of the new Type we want.

Roslyn Compilation with AssemblyGenerator

Based on a blog post by Tugberk Ugurlu, I built the AssemblyGenerator class in Marten shown below that invokes Roslyn to compile C# code and load the new dynamically built Assembly into the application:

public class AssemblyGenerator
{
    private readonly IList _references = new List();

    public AssemblyGenerator()
    {
        ReferenceAssemblyContainingType<object>();
        ReferenceAssembly(typeof (Enumerable).Assembly);
    }

    public void ReferenceAssembly(Assembly assembly)
    {
        _references.Add(MetadataReference.CreateFromFile(assembly.Location));
    }

    public void ReferenceAssemblyContainingType<T>()
    {
        ReferenceAssembly(typeof (T).Assembly);
    }

    public Assembly Generate(string code)
    {
        var assemblyName = Path.GetRandomFileName();
        var syntaxTree = CSharpSyntaxTree.ParseText(code);

        var references = _references.ToArray();
        var compilation = CSharpCompilation.Create(assemblyName, new[] {syntaxTree}, references,
            new CSharpCompilationOptions(OutputKind.DynamicallyLinkedLibrary));


        using (var stream = new MemoryStream())
        {
            var result = compilation.Emit(stream);

            if (!result.Success)
            {
                var failures = result.Diagnostics.Where(diagnostic =>
                    diagnostic.IsWarningAsError ||
                    diagnostic.Severity == DiagnosticSeverity.Error);


                var message = failures.Select(x => $"{x.Id}: {x.GetMessage()}").Join("\n");
                throw new InvalidOperationException("Compilation failures!\n\n" + message + "\n\nCode:\n\n" + code);
            }

            stream.Seek(0, SeekOrigin.Begin);
            return Assembly.Load(stream.ToArray());
        }
    }
}

At runtime, you use the AssemblyGenerator class by telling it which other assemblies it should reference and giving it the source code to compile:

// Generate the actual source code
var code = GenerateDocumentStorageCode(mappings);

var generator = new AssemblyGenerator();

// Tell the generator which other assemblies that it should be referencing 
// for the compilation
generator.ReferenceAssembly(Assembly.GetExecutingAssembly());
generator.ReferenceAssemblyContainingType<NpgsqlConnection>();
generator.ReferenceAssemblyContainingType<QueryModel>();
generator.ReferenceAssemblyContainingType<DbCommand>();
generator.ReferenceAssemblyContainingType<Component>();

mappings.Select(x => x.DocumentType.Assembly).Distinct().Each(assem => generator.ReferenceAssembly(assem));

// build the new assembly -- this will blow up if there are any
// compilation errors with the list of errors and the actual code
// as part of the exception message
var assembly = generator.Generate(code);

Finally, once you have the new Assembly, use Reflection just to find the new Type you want by either searching through Assembly.GetExportedTypes() or by name. Once you have the Type object, you can build that object through Activator.CreateInstance(Type) or any of the other normal Reflection mechanisms.

The Warmup Problem

So I’m very happy with using Roslyn in this way so far, but the initial “warmup” time on the very first usage of the compilation is noticeably slow. It’s a one time hit on startup, but this could get annoying when you’re trying to quickly iterate or debug a problem in code by frequently restarting the application. If the warmup problem really is serious in real applications, we may introduce a mode that just lets you export the generated code to file and have that code compiled with the rest of your project for much faster startup times.

Optimizing for Performance in Marten

For the last couple weeks I’ve been working on a new project called Marten that is meant to exploit Postgresql’s JSONB data as a full fledged document database for .Net development as a drop in replacement for RavenDb in our production environment. I think that I would say that our primary goal with Marten is improved stability and supportability, but maximizing performance and throughput is a very close second in the priority list.

This is my second update on Marten progress. From last week, also see Marten Development So Far.

So far, I’ve mostly been focusing on optimizing the SQL queries generated by the Linq support for faster fetching. I’ve been experimenting with a few different query modes for the SQL generation based on what fields or properties you’re trying to search on:

  1. By default in the absence of any explicit configuration, Marten tries to use the “jsonb_to_record” function with a LATERAL join approach to optimize queries against members on the root of the document.
  2. You can also force Marten to only use basic Postgresql JSON locators to generate the where clauses in the SQL statements
  3. Finally, if you know that your application will be frequently querying a document type against a certain member, Marten can use a “searchable” field such that it duplicates that data in a normal database field and searches directly against that database field. This mechanism will clearly slow down your inserts and take up somewhat more storage space, but the numbers I’m about to display don’t lie, this is very clearly the fastest way to optimize queries using Marten (so far).

I’ve also experimented with both the Newtonsoft.Json serializer and the faster, but less flexible Jil serializer. Again, the numbers are pretty clear that for bigger result sets, Jil is much faster (NetJSON was a complete bust for me when I tried it). So far I’ve been able to keep Marten serializer-agnostic and I can easily see times when you’d have to opt for Newtonsoft’s flexibility.

Default jsonb_to_record/LATERAL JOIN

Using this approach, the SQL generated is:

select d.data from mt_doc_target as d, LATERAL jsonb_to_record(d.data) as l("Date" date) where l."Date" = :arg0

Json Locators Only

While you can configure this behavior on a field by field basis, the quickest way is to just set the default document behavior:

public class JsonLocatorOnly : MartenRegistry
{
    public JsonLocatorOnly()
    {
        // This can also be done with attributes
        For<Target>().PropertySearching(PropertySearching.JSON_Locator_Only);
    }
}

With this setting, the generated SQL is:

select d.data from mt_doc_target as d where CAST(d.data ->> 'Date' as date) = :arg0

Searchable, Duplicated Field

Again, to configure this option, I used this code:

public class DateIsSearchable : MartenRegistry
{
    public DateIsSearchable()
    {
        // This can also be done with attributes
        For<Target>().Searchable(x => x.Date);
    }
}

When I do this, the table for the Target type has an additional field called “date” that will get the value of the Target.Date property every time a Target object is inserted or updated in the database.

The resulting SQL is:

select d.data from mt_doc_target as d where d.date = :arg0

The Performance Results

I created the table below by generating randomized data, then trying to search by a DateTime field using three different mechanisms:

var theDate = DateTime.Today.AddDays(3);
var queryable = session.Query<Target>().Where(x => x.Date == theDate);

In all cases, I used the same sample data for the document count and took an average of running the same query five times after throwing out an initial attempt where Postgresql seemed to be “warming up” the JSONB data.

Serializer: JsonNetSerializer

Query Type 1K 10K 100K 1M
JSON Locator Only 9.6 75.2 691.2 9648
jsonb_to_record + lateral join 10 93.6 922.6 12091.2
searching by duplicated field 2.4 15 169.6 2777.8

Serializer: JilSerializer

Query Type 1K 10K 100K 1M
JSON Locator Only 6.8 61 594.8 7265.6
jsonb_to_record + lateral join 8.4 86.6 784.2 9655.8
searching by duplicated field 1 8.8 115.4 2234.2

To be honest, I expected the JSONB_TO_RECORD + LATERAL JOIN mechanism to be faster than the JSON locator only approach, but I need to go back and try to add some indexes because that’s supposed to be the benefit of using JSONB_TO_RECORD to avoid the object casts that inevitably defeat indexes. I’d be happy to get some Postgresql gurus to weigh in here if there are any reading this.

If you’re curious to see my mechanism for recording this data, see the performance_tuning code file in GitHub.

Bulk Loading Documents

From time to time (testing or data migrations maybe) you’ll have some need to very rapidly load a large set of documents into your database. I added a feature this morning to Marten that exploits Postgresql’s COPY feature supported by Npgsql:

public void load_with_small_batch()
{
    // This is just creating some randomized
    // document data
    var data = Target.GenerateRandomData(100).ToArray();

    // Load all of these into a Marten-ized database
    theSession.BulkLoad(data);

    // And just checking that the data is actually there;)
    theSession.Query<Target>().Count().ShouldBe(data.Length);
    theSession.Load<Target>(data[0].Id).ShouldNotBeNull();
}

Behind the scenes, Marten is using code generation at runtime and compiled by Roslyn to do the bulk loading as efficiently as possible without any hit from using reflection:

public void Load(ISerializer serializer, NpgsqlConnection conn, IEnumerable documents)
{
    using (var writer = conn.BeginBinaryImport("COPY mt_doc_target(id, data) FROM STDIN BINARY"))
    {
        foreach (var x in documents)
        {
            writer.StartRow();
            writer.Write(x.Id, NpgsqlDbType.Uuid);
            writer.Write(serializer.ToJson(x), NpgsqlDbType.Jsonb);
        }
    }
}

Do note that the code generation mechanism is smart enough to also add any fields or properties of the document type that are marked as duplicated for searching.

Other Outstanding Optimization Tasks 

  • Optimize the mechanics for applying all the changes in a unit of work. I’m hoping that we can do something to reduce the number of network round trips between the application and the postgresql server. My fallback approach is going to be to use a custom PLV8 sproc, but not until we exhaust other possibilities with the Npgsql library.
  • I want some mechanism for queuing up queries and submitting them in one network round trip
  • The ability to make a named, reusable Linq query so you can reuse the underlying ADO.Net command generated from parsing the Linq expression without having to go through all the Expression parsing gymnastics on each usage
  • Really more for scalability than performance, but we’ll get around to asynchronous query methods. I’m just not judging that to be a critical path item right now.
  • It’s probably minor in the grand scheme of things, but the actual Linq expression to Sql query generation is grotesque in how it concatenates strings

Feel very free to make suggestions and other feedback on these items;-)