Optimizing Marten Performance by using “Include’s”

Continuing my series of short blog posts about concepts or new features in Marten, this time I’m going to show how to use the brand new IQueryable<T>().Include() feature to improve performance by reducing the number of network round trips to the underlying Postgresql database. Check out the Marten tag for related blog posts.

In one of the formative experiences of my early software career, our instructor repeated the phrase “network round trips are evil” as a kind of mantra. The point, as I quickly learned then and plenty of times later, is that a “chatty” system making lots of network round trips between boxes can be very slow if you’re not careful.

To that end, many of our early adopters and would be adopters said that they would switch to Marten if only we had the Include() feature from RavenDb. The point of this feature is to reduce network round trips to the database server from your application by being able to fetch related documents at the same time.

Jumping right to a concrete example, let’s say that your domain has two document types, one called “Issue” and another called “User.” In this case the Issue maybe have logical links to the assigned user responsible for addressing the issue, partially shown below:

    public class Issue
    {
        // The AssigneeId would be the Id of the
        // related User document
        public Guid? AssigneeId { get; set; }

    }

If I want to load an Issue by its Id, but also get the assigned User at the same time, I can use Marten’s new “Include()” feature inspired by RavenDb and NHibernate’s QueryOver mechanism:

[Fact]
public void simple_include_for_a_single_document()
{
    var user = new User();
    var issue = new Issue {AssigneeId = user.Id, Title = "Garage Door is busted"};

    theSession.Store<object>(user, issue);
    theSession.SaveChanges();

    using (var query = theStore.QuerySession())
    {
        User included = null;
        var issue2 = query.Query<Issue>()
            // Using the call below, Marten will execute
            // the supplied callback to pass back the related
            // User document assigned to the Issue
            .Include(x => x.AssigneeId, x => included = x)
            .Where(x => x.Title == issue.Title)
            .Single();

        included.ShouldNotBeNull();
        included.Id.ShouldBe(user.Id);

        issue2.ShouldNotBeNull();

        // All of this was done with exactly one call to Postgresql
        query.RequestCount.ShouldBe(1);
    }
}

The actual SQL statement sent to Postgresql in the code above would be:

select d.data, d.id, assignee_id.data, assignee_id.id from public.mt_doc_issue as d INNER JOIN public.mt_doc_user as assignee_id ON CAST(d.data ->> 'AssigneeId' as uuid) = assignee_id.id where d.data ->> 'Title' = :arg0 LIMIT 1

In the code above, I’m using the fairly new “Include()” statement to direct Marten to fetch the related User document at the same time it’s retrieving the Issue. We deviated somewhat from RavenDb in this feature. Instead of just adding the included documents to the internal identity map and expecting the user to just “know” that they are cached, we opted to make the included documents accessible to the caller through either:

  1. Passing a callback function into the Include() method
  2. Passing an IList<T>, where T is the included document type, into Include(). In this case, Marten will fill the list with all the included documents found.
  3. Passing an IDictionary<TKey, T> into the Include() method that will be filled by the Id of the included documents found

 

Since the pace of development on Marten is temporarily outpacing my efforts at keeping the documentation website completely up to date, the best resource for seeing what’s possible with the Include() functionality is our acceptance tests.

Let me end with a couple salient points about the new Include() functionality:

  • The included documents are resolved through the internal identity map of the current session, so there will not be any duplicates from repeated documents. Think about the case of fetching 100 Issue’s that are all assigned to one of 5 different User’s. In this case, only the 5 reoccurring User documents would be returned.
  • You can do multiple Include()’s on one query
  • The Include() functionality is available in the batched query feature
  • This will be a topic for another post, but Marten already supports the creation of foreign key relationships between documents
  • By default, Marten uses an outer join to fetch the included documents. I didn’t show it above, but there is also an optional argument in the Include() method you can use to force Marten to use a more efficient inner join — but just remember that means that nothing will be returned in the case of a NULL Issue.AssigneeId in the example above.

 

For my next blog post, I’ll talk about our brand new as-of-this-morning “Compiled Query” feature.

Select Projections in Marten

When I did the DotNetRocks episode on Marten awhile back, they asked me what I thought the biggest holes in Marten functionality and where we would take it next. The first thing that came to my mind was the “read side.” By that I meant built in functionality to transform the raw documents stored with Marten into the shape needed for API’s, business functionality, and web pages. Fortunately, the very latest versions of Marten add some important functionality to support the Linq Select() keyword for fetching document data with transformations.

See CQRS from Martin Fowler for more context on where and how the “read side” fits into a software architecture.

To make this concrete, let’s say that you only really want one property or field of a stored document. That one field can now be fetched without incurring the cost of deserializing the raw JSON of the whole document. Instead, we’ll just fetch our one field:

        [Fact]
        public void use_select_in_query_for_one_field_and_first()
        {
            theSession.Store(new User { FirstName = "Hank" });
            theSession.Store(new User { FirstName = "Bill" });
            theSession.Store(new User { FirstName = "Sam" });
            theSession.Store(new User { FirstName = "Tom" });

            theSession.SaveChanges();

            theSession.Query<User>().OrderBy(x => x.FirstName).Select(x => x.FirstName)
                .First().ShouldBe("Bill");

        }

Maybe not *that* commonly useful, so what about if you want to select a subset of a large document? If you want to select into a smaller type, you can use code like this:

[Fact]
public void use_select_with_multiple_fields_to_other_type()
{
    theSession.Store(new User { FirstName = "Hank", LastName = "Aaron" });
    theSession.Store(new User { FirstName = "Bill", LastName = "Laimbeer" });
    theSession.Store(new User { FirstName = "Sam", LastName = "Mitchell" });
    theSession.Store(new User { FirstName = "Tom", LastName = "Chambers" });

    theSession.SaveChanges();

    var users = theSession.Query<User>().Select(x => new User2{ First = x.FirstName, Last = x.LastName }).ToList();

    users.Count.ShouldBe(4);

    users.Each(x =>
    {
        x.First.ShouldNotBeNull();
        x.Last.ShouldNotBeNull();
    });
}

In the case above, you are selecting the User document data into another type called “User2.” That’s great of course, but you can also bypass the need for a custom class and just select straight to an anonymous type like this:

[Fact]
public void use_select_with_multiple_fields_in_anonymous()
{
    theSession.Store(new User { FirstName = "Hank", LastName = "Aaron"});
    theSession.Store(new User { FirstName = "Bill", LastName = "Laimbeer"});
    theSession.Store(new User { FirstName = "Sam", LastName = "Mitchell"});
    theSession.Store(new User { FirstName = "Tom", LastName = "Chambers"});

    theSession.SaveChanges();

    var users = theSession.Query<User>().Select(x => new {First = x.FirstName, Last = x.LastName}).ToList();

    users.Count.ShouldBe(4);

    users.Each(x =>
    {
        x.First.ShouldNotBeNull();
        x.Last.ShouldNotBeNull();
    });
}

Finally, if all you need to do is stream the raw JSON data of your transformed documents straight to the HTTP response in a web request, you can skip the unnecessary JSON deserialization and serialization and probably achieve much better throughput in your web application by selecting straight to a JSON string:

[Fact]
public void use_select_to_another_type_and_to_json()
{
    theSession.Store(new User { FirstName = "Hank" });
    theSession.Store(new User { FirstName = "Bill" });
    theSession.Store(new User { FirstName = "Sam" });
    theSession.Store(new User { FirstName = "Tom" });

    theSession.SaveChanges();

    theSession.Query().OrderBy(x => x.FirstName).Select(x => new UserName { Name = x.FirstName })
        .ToListJson()
        .ShouldBe("[{\"Name\" : \"Bill\"},{\"Name\" : \"Hank\"},{\"Name\" : \"Sam\"},{\"Name\" : \"Tom\"}]");
}

In the code above, the JSON of the User document is transformed by Postgresql itself and returned to the caller as a single JSON string.

Not that you necessarily want to do a lot of this by hand (but you always could), the SQL above can be found and printed with this:

    var command = theSession
                .Query()
                .OrderBy(x => x.FirstName)
                .Select(x => new UserName {Name = x.FirstName})
                
                // ToCommand() is a Marten specific extension
                // we use as a diagnostic tool to understand
                // how Marten is treating any given Linq expression
                .ToCommand(FetchType.FetchMany);

    Console.WriteLine(command.CommandText);

That gives us this generated SQL:

select json_build_object('Name', d.data ->> 'FirstName') as json from public.mt_doc_user as d order by d.data ->> 'FirstName'

What’s left to do?

We’ve had a couple bugs with our Select() support from early adopters with permutations I hadn’t thought to test beforehand, so I’m pretty sure that there are more things to iron out. The best way to solve that problem is to just try to get more early users to beat on it and find what’s still missing.

The Select() support today doesn’t yet include any transformations of child collection data or transforming data within a JSON document. We’d probably also want to support selecting data straight out of child collections as well.

The big new thing in our “read side” repertoire is going to be calculated projections. You can follow that work on GitHub.

The Identity Map Pattern in Marten

I’m still a believer in learning and discussing design patterns — even though everyone has seen a naive architect of some kind write stupidly over-engineered code with every possible GoF buzzword possible. That being said, there’s some significant value in having industry wide understanding of common coding solutions and the design pattern names should make it much easier to find information about prior art online. As an aside, I hate it when developers online make an argument against any particular tool or technique because “one time they were on a project where it sucked” without any thought about how it was used or whether or not the problem just wasn’t a good fit for that tool or technique. I think it’s sloppy thinking.

The Identity Map pattern is an important conceptual design pattern underneath many database and persistence libraries, including Marten. I think it’s important to understand because the usage of an identity map can help performance in some cases, hurt your system’s memory utilization in other cases, and quite potentially prevent data integrity and consistency issues.

As usual, I’ll pull the definition of the Identity Map pattern from Martin Fowler’s PEAA book:

Ensures that each object gets loaded only once by keeping every loaded object in a map. Looks up objects using the map when referring to them.

The purpose of using an identity map is to avoid accidentally making multiple copies of a loaded entity in memory. In the case of a complex operation that is complex enough to be handled by multiple collaborator classes or functions, it is frequently valuable to depend on using a shared identity map between the collaborators to prevent unnecessarily fetching the exact same data from the database more than once.

To see this in action in Marten, consider the following usage in Marten. Let’s say that we have a document called “User” that is identified by a surrogate Guid property. If I open up a new document session with the default configuration, you would see this behavior:

public void using_identity_map()
{
    var container = Container.For<DevelopmentModeRegistry>();
    var store = container.GetInstance<IDocumentStore>();

    var user = new User {FirstName = "Tamba", LastName = "Hali"};
    store.BulkInsert(new [] {user});

    // Open a document session with the identity map
    using (var session = store.OpenSession())
    {
        // Load a user with the same Id will return the very same object
        session.Load<User>(user.Id)
            .ShouldBeTheSameAs(session.Load<User>(user.Id));

        // And to make this more clear, Marten is only making a single
        // database call
        session.RequestCount.ShouldBe(1);
    }
} 

In our applications at work, the IDocumentSesssion (“session” above) that wraps the intenal identity map (and unit of work too) would usually be scoped to a web request in HTTP applications and to a single message in our service bus applications. We do this so that different pieces of middleware code and message handlers would all be using the same identity map to avoid double loading or inconsistent state.

 

Automatic Dirty Checking

A heavier weight flavor of identity map is one that does automatic “dirty checking” to know what documents loaded through the IDocumentSession have been changed in memory and should therefore be persisted when the session is saved.

[Fact]
public void when_querying_and_modifying_multiple_documents_should_track_and_persist()
{
    var user1 = new User { FirstName = "James", LastName = "Worthy 1" };
    var user2 = new User { FirstName = "James", LastName = "Worthy 2" };
    var user3 = new User { FirstName = "James", LastName = "Worthy 3" };

    theSession.Store(user1);
    theSession.Store(user2);
    theSession.Store(user3);

    theSession.SaveChanges();

    using (var session2 = CreateSession())
    {
        var users = session2.Query<User>().Where(x => x.FirstName == "James").ToList();

        // Mutating each user
        foreach (var user in users)
        {
            user.LastName += " - updated";
        }

        // Persisting the session will save all the documents
        // that have changed
        session2.SaveChanges();
    }

    using (var session2 = CreateSession())
    {
        var users = session2.Query<User>()
            .Where(x => x.FirstName == "James")
            .OrderBy(x => x.LastName).ToList();

        // Just proving out that every User was persisted
        users.Select(x => x.LastName)
            .ShouldHaveTheSameElementsAs("Worthy 1 - updated", "Worthy 2 - updated", "Worthy 3 - updated");
    }
}

In the usage above, I never had to explicitly mark with User objects had been changed. In this type of session, Marten is tracking the raw JSON used to load each document. At the time SaveChanges() is called, Marten will do a logical comparison of the current document state to the original, loaded state by doing a logical comparison of the JSON structure (it’s inevitably using Newtonsoft.Json under the covers to do the comparison of the JSON data, but you already guessed that).

Some of our users really like the convenience of the automatic dirty checking, but other times you’ll definitely want to forgo the heavier, more memory and processor intensive version of the identity map in favor of lightweight sessions as shown in the next section.

A favorite ritual of my childhood was rooting hard for the Showtime Lakers every summer in the NBA finals while my Dad was all about the Larry Bird/Kevin McHale/Robert Parrish Celtics. Somewhere or another there’s a good chunk of the roster of the ’85 Lakers as test data in most of the projects I work on.

Opting out of the Identity Map

Veterans users of RavenDb are probably painfully aware of how fetching a large amount of data can quickly blow up your system’s memory usage by having it keep so much of the raw JSON structures and pointers to the loaded objects in memory (if you use the default configuration). Because of this all too frequent problem with RavenDb usage, we designed Marten to make it as easy and declarative as possible to use lightweight sessions or pure query sessions that have no identity map or automatic dirty tracking, like so:

            // Opened from an existing IDocumentStore called "store"
            using (var session = store.LightweightSession())
            {

            }

            // A lightweight, readonly session 
            using (var query = store.QuerySession())
            {

            }

Likewise, we made the very heavyweight, automatic dirty tracking flavor of a document session be “opt in” with the belief that this option doesn’t shoot unsuspecting users in the foot.

Marten does not yet support any kind of notion of “Evict()” to remove previously loaded documents from the underlying identity map. To date, my philosophy is to give the users easier access to the lightweight sessions to side step the whole issue of needing to evict documents manually.

What about Queries?

You might notice that all of my examples of the identity map behavior used the IDocumentSession.Load<T>(id) method to load a single document by its id. In this usage, a Marten document session first checks its internal identity map to see if that document has already been loaded. If not, the session will load the document and save it to the underlying identity map.

Great, but you’re likely asking “what about Linq queries?” We introduced the identity map mechanics fairly early in Marten and run all queries through the identity map caching, but the Linq query works by returning a data reader of a document’s Id and the raw, persisted JSON. As Marten reads through the results of a data reader, for each row it will call the following method in Marten’s internal IIdentityMap interface:

// This method would either return an existing document
// with the id, or deserialize the JSON into a new
// document object and store that in the identity map
T Get<T>(object id, string json) where T : class;

While using a Linq query does honor the identity map tracking, it can result in fetching the raw JSON data multiple times, but does prevent duplication of documents and unnecessary deserialization at runtime.

 

Natural versus Surrogate Keys

Using document databases over the past 5 or so years has changed my old attitudes toward the choice of natural database keys (some piece of data that has actual meaning like names) versus surrogate keys like Guid’s or  sequential numbers. 5-10 years ago in the days of heavyweight ORM’s like NHibernate (or today if you’re a mainstream .Net developer using tools like EF) I would have been adamantly in favor of only using surrogate keys. One, because a natural key can change and it can be clumsy to modify the primary key of a relational database table. Two because using a surrogate key meant that you could adopt some kind of layer supertype for all of your entities that would allow you to centralize and reuse a lot of your application’s workflow for typical CRUD operations.

Today however, I think that there are such valuable performance advantages to being able to efficiently load documents by their natural identifier through an identity map, that this choice is no longer so clear cut. Take the example of a document representing a “User.” At login time or even after authentication, you mostly likely have the user name, but not necessarily any kind of Guid representing that user. If we modeled the “User” document with the login name as a natural key, we can efficiently load user documents by that user name.

The example above isn’t the slightest bit contrived. Rather it’s exactly the mistake I made when I designed the persisted membership feature of FubuMVC that is backed by RavenDb that is still in a couple of our systems at work. In our case, we have to load a user by querying by the user name we have instead of the Guid surrogate key that we don’t know upfront. That’s not that big of a deal with Postgresql-backed Marten, but it became a significant problem for us with RavenDb because it forces RavenDb to have to load the document by using a readside index, which is a less efficient mechanism than loading by id. In this case, we could have had a more efficient login identity solution if I’d broken away from the “old think” belief in the primacy of surrogate keys in all situations. Lesson learned.

The Unit of Work Pattern in Marten

Design patterns got a bad rap when they were horrendously overused in the decade plus following the publication of the gang of four book. I’m still a believer in learning design patterns. I even think it’s valuable to know and understand the “official” names for patterns. One, because it’s really nice for other developers to understand the jargon that you’re using to describe possible solutions, but mostly so that you can easily go Google for a lot more information about how, when, and when not to use it later as you need to know more.

That being said, I think it’s useful for new users to Marten to understand the old “Unit of Work” design pattern, both conceptually and how Marten implements the pattern.

Jeremy’s Law of Nerd-Rage: a group of developers being enthusiastic about anything related to development will eventually make a second group of developers angry and cause a backlash.

 

Unit of Work

From Martin Fowler’s PEAA book:

Maintains a list of objects affected by a business transaction and coordinates the writing out of changes and the resolution of concurrency problems.

The unit of work pattern is a simple way to collect all the changes to a backing data store you need to be committed in a single transaction. It’s especially useful if you need to have several unrelated pieces of code collaborating on the same logical transaction. If you just pass around a “unit of work” container for the changes to each “worker” object or function, you can keep the various things completely decoupled from each other and still collaborate on a single business transaction.

In Marten, the unit of work is buried behind our IDocumentSession interface (that might change later). When you register changes to an IDocumentSession (inserts, updates, deletes, appending events, etc.), these changes are only staged until the SaveChanges() method is called to persist all the pending changes in one database transaction.

You can see Marten’s unit of work mechanics in action in the code sample below:

// theStore is a DocumentStore
using (var session = theStore.OpenSession())
{
    // All I'm doing here is recording references
    // to all the ADO.Net commands executed by
    // this session
    var logger = new RecordingLogger();
    session.Logger = logger;

    // Insert some new documents
    session.Store(new User {UserName = "luke", FirstName = "Luke", LastName = "Skywalker"});
    session.Store(new User {UserName = "leia", FirstName = "Leia", LastName = "Organa"});
    session.Store(new User {UserName = "wedge", FirstName = "Wedge", LastName = "Antilles"});


    // Delete all users matching a certain criteria
    session.DeleteWhere<User>(x => x.UserName == "hansolo");

    // deleting a single document by Id, if you had one
    session.Delete<User>(Guid.NewGuid());

    // Persist in a single transaction
    session.SaveChanges();

    // All of this was done in one batched command
    // in the same transaction
    logger.Commands.Count.ShouldBe(1);

    // I'm just writing out the Sql executed here
    var sql = logger.Commands.Single().CommandText;
    new FileSystem()
        .WriteStringToFile("unitofwork.sql", sql);

}

All of the database changes above are made in a single database call within one transaction (I added extra new lines to make it readable):

select mt_upsert_user(doc := :p0, docId := :p1);
select mt_upsert_user(doc := :p2, docId := :p3);
select mt_upsert_user(doc := :p4, docId := :p5);
delete from mt_doc_user as d where d.data ->> 'UserName' = :arg6;
delete from mt_doc_user where id=:p6;

As for how to consume IDocumentSession’s and how to scope them for transactional boundary management, see my blog post on transactions with RavenDb. My intention is for us to use the same conceptual setup with Marten at work.

Typically, I would recommend using an IDocumentSession per HTTP request or within the handling for a single service bus message. We even build our basic infrastructure around this concept. You still need an easy way to use explicit transaction boundaries with a new IDocumentSession on demand. I wrote about our transaction management strategies with RavenDb a couple years ago, and my intention is that we’ll use Marten in a very similar manner.

Marten on DotNetRocks

Remarkably coincidental with the new Marten v0.8 release, the DotNetRocks guys just published a podcast talking with me about Marten. Give it a shot if you wanna learn a little bit about the motivation behind Marten, using Postgresql for .Net development, and using “polyglot persistence.”

I think the DNR guys were trying to get me to trash talk a little bit about Marten versus its obvious competitor, so how about just saying that I feel like Marten will lead to more developer productivity than using Entity Framework or micro-ORM’s and better performance and reliability than existing NoSQL solutions for .Net.

Marten Gets Serious with v0.8

EDIT: DotNetRocks just published a new episode about Marten.

I just pushed a new v0.8 version of Marten to Nuget.org with a handful of new features, a lot of bugfixes, and a slew of refinements in response to feedback from our early adopters. I feel like this release added some polish to Marten, and I largely attribute that to how helpful our community has been in making suggestions and contributing pull requests based on their early usage.

The documentation site has been updated to reflect the new changes and additions to Marten for v0.8. The full list of changes and bug fixes for v0.8 is available in the GitHub issue history for this milestone.

Release Highlights

  1. Marten got a lot more sophisticated about how it updates schema objects in the underlying Postgresql database. See the changes to AutoCreateSchemaObjects and the ability to update a table for additional “searchable” columns instead of dropping and recreating tables in the documentation.
  2. Bulk Deletes” by a query without first having to load documents before deleting them
  3. The Linq parsing support got worked over pretty hard, including some new linq query support extensibility.
  4. New mechanisms for instrumentation and diagnostics within Marten. I’ll have a blog post next week about how we’re going to integrate with this at work for profiling database activity in our web and service bus apps.
  5. A lot of work to integrate the forthcoming event store functionality. I left this out of the documentation updates for now, but there’ll be a lot more about this later.

 

What’s Up Next?

There’s some preliminary planning for Marten v0.9 on GitHub. The very next feature I’m tackling is going to be our equivalent to RavenDb’s Include() feature (but I think we’re going to do it differently). Other than that, I think the theme of the next release is going to be addressing the “read side” of Marten applications with view projections by finally supporting “Select()” in the Linq support and some mix of Javascript or .Net projections.

A note on the versioning

While I am a big believer in semantic versioning (or at least I think it’s far better than having nothing), Marten is still pre 1.0 and a few public API’s did change. We haven’t discussed a timeline for the all important “1.0” release, but I’m hopeful that that can happen by this summer (2016).

An Example of the Open/Closed Principle in Action

I saw someone on Twitter this month say that they’ve never really understood the Open/Closed Principle (OCP, the “O” in S.O.L.I.D.). I think it’s a very important concept in software architecture, but the terse statement maybe doesn’t make it too clear what it’s really about:

software entities (classes, modules, functions, etc.) should be open for extension, but closed for modification

There are some other ways to interpret the Open/Closed Principle (the Wikipedia article about it talks about inheritance which I think is short sighted), but my restatement of the OCP would be:

“structure code such that additional functionality can be mostly written in all new code modules with no, or at least minimal, changes to existing code modules”

The key point is that it’s much less risky and usually easier to write brand new code — especially if the new code has minimal coupling to old code — than it is to modify existing code. Or to put it another way, can I continually add all new functionality to my existing system without causing a lot of regression bugs?

It’s not just risk either, it’s generally easier to understand complicated new code written on a blank canvas than it is to open up an existing code file and find the right places to insert your changes without breaking the old functionality.

 

Some examples of systems from my career that most definitely did not follow OCP might better illustrate why you’d care about OCP:

  1. Dynamic web page application that was effectively written in one single VB6 class. Every single addition or fix to the application meant editing that one single file, and very frequently broke existing functionality
  2. A large shipping application where every bit of routing logic for box positions within a factory floor were coded in a single, giant switch statement that shared a lot of global state. Again, changes to routing logic commonly broke existing functionality. The cost of regression testing this routing logic slowed down the team in charge of this system considerably.
  3. COBOL style batch processes coded in giant stored procedures with lots of global state
  4. Naive usages of Redux in Javascript could easily lead to the massive switch statement problem where all kinds of unrelated code changes involve the same central file

An OCP Example: Linq Provider Extensibility in Marten

We’ve been building (and building and building) Linq query support into Marten. Linq support is the type of problem I refer to as “Permutation Hell,” meaning that there’s an almost infinite supply of “what about querying by this type/operator/method call?” use cases. Recently, one of our early adopters asked for Linq support for querying by a range of values like this:

// Find all SuperUser documents where the role is "Admin", 
// "Supervisor", or "Director"
var users = theSession.Query<SuperUser>()
    .Where(x => x.Role.IsOneOf("Admin", "Supervisor", "Director"));

In the case above, IsOneOf() is a custom extension method in Marten that just means “the value of this property/field should be any of these values”.

I thought that was a great idea, but at the time the Linq provider code in Marten was effectively a “Spike-quality” blob of if/then branching logic. Extending the Linq support meant tracing through the largely procedural code to find the right spot to insert the new parsing logic. I think recognizing this, our early adopter also suggested making an extensibility point so that users and contributors could easily author and add new method parsing to the Linq provider.

What we really needed was a little bit of Open/Closed structuring so that additional method call parsing for things like IsOneOf() could be written in brand new code instead of trying to ram more branching logic into the older MartenExpressionParser class (the link is to an older version;)).

Looking through the old Linq parsing code, I realized there was an opportunity to abstract the responsibility for handling a call to a method in Linq queries behind this interface from Marten:

    /// <summary>
    /// Models the Sql generation for a method call
    /// in a Linq query. For example, map an expression like Where(x => x.Property.StartsWith("prefix"))
    /// to part of a Sql WHERE clause
    /// </summary>
    public interface IMethodCallParser
    {
        /// <summary>
        /// Can this parser create a Sql where clause
        /// from part of a Linq expression that calls
        /// a method
        /// </summary>
        /// <param name="expression"></param>
        /// <returns></returns>
        bool Matches(MethodCallExpression expression);

        /// <summary>
        /// Creates an IWhereFragment object that Marten
        /// uses to help construct the underlying Sql
        /// command
        /// </summary>
        /// <param name="mapping"></param>
        /// <param name="serializer"></param>
        /// <param name="expression"></param>
        /// <returns></returns>
        IWhereFragment Parse(
            IDocumentMapping mapping, 
            ISerializer serializer, 
            MethodCallExpression expression
            );
    }

The next step was to pull out strategy classes implementing this interface for the method we already supported like String.Contains()String.StartsWith(), or String.EndsWith(). Inside of the Linq provider support, the next step was to select the right strategy for a method expression and use that to help create the Sql string:

protected override Expression VisitMethodCall(MethodCallExpression expression)
{
    var parser = _parent._options.Linq.MethodCallParsers.FirstOrDefault(x => x.Matches(expression)) 
        ?? _parsers.FirstOrDefault(x => x.Matches(expression));

    if (parser != null)
    {
        var @where = parser.Parse(_mapping, _parent._serializer, expression);
        _register.Peek()(@where);

        return null;
    }


    throw new NotSupportedException($"Marten does not (yet) support Linq queries using the {expression.Method.DeclaringType.FullName}.{expression.Method.Name}() method");
}

Once that was in place, I could build out the IsOneOf() search functionality by building an all new class implementing that IMethodCallParser interface described above. To wire up the new strategy, it was a one line change to the existing Linq code:

        // The out of the box method call parsers
        private static readonly IList<IMethodCallParser> _parsers = new List<IMethodCallParser>
        {
            new StringContains(),
            new EnumerableContains(),
            new StringEndsWith(),
            new StringStartsWith(),

            // Added
            new IsOneOf()
        };

So yes, I did have to “open” up the existing code to make a small change to enable the new functionality, but at least it was a low impact change with minimal risk.

I didn’t show it in this post, but there is also a new way to add your own implementations of IMethodCallParser to a Marten document store. I’m not entirely sure how many folks will take advantage of that extensibility point, but the structural refactoring I did to enable this story should make it much easier for us to continue to refine our Linq support.

My example is yet another example of using plugin strategies to demonstrate the Open/Closed Principle, but I think the real emphasis should be on compositional designs. Even without formal plugin patterns or IoC containers or configuration strategies, using the OCP to guide your design thinking about how to minimize the risk of later changes is still valuable.

 

 

 

 

 

StructureMap 4.1 is Out

I just made pushed the nuget for StructureMap 4.1 to nuget.org and updated the documentation website for the changes. It’s not a huge release, but there were some bug fixes and one new feature that folks were asking for. StructureMap tries hard to follow Semantic Versioning guidelines, and the minor point version just denotes that there are new public API’s, but no existing API’s from 4.0.* were changed.

Thank you to everyone who contributed pull requests and to the users who patiently worked with me to understand what was going wrong in their StructureMap usage.

What’s different?

For the entire list, see the GitHub issue list for 4.1. The highlights are:

  1. The assembly discovery mechanism in type scanning has new methods to scan for “.exe” files as Assembly candidates. I had removed “.exe” file searching in 4.0 thinking that it was more problematic than helpful, then several users asked for it back.
  2. The assembly discovery handles the PrivateBinPath of the AppDomain in all (known) cases now
  3. Child container creation is thread safe now

 

Batch Queries with Marten

Marten v0.7 was published just under two weeks ago, and one of the shiny new features was the batched query model with let’s say a trial balloon syntax that was shot down pretty fast in the Marten Gitter room (I wasn’t happy with it either). To remedy that, we pushed a new Nuget this morning (v0.7.1) that has a new, streamlined syntax for the batched query and updated the batched query docs to match.

So here’s the problem it tries to solve, say you have an HTTP endpoint that needs to aggregate several different sources of document data into a single, aggregated JSON message back to your web client (this is a common scenario in a large application at my work that is going to be converted to Marten shortly). To speed up that JSON endpoint, you’d like to be able to batch up those queries into a single call to the underlying Postgresql database, but still have an easy way to get at the results of each query later. This is where Marten’s batch query functionality comes in as demonstrated below:

// Start a new IBatchQuery from an active session
var batch = theSession.CreateBatchQuery();

// Fetch a single document by its Id
var user1 = batch.Load<User>("username");

// Fetch multiple documents by their id's
var admins = batch.LoadMany<User>().ById("user2", "user3");

// User-supplied sql
var toms = batch.Query<User>("where first_name == ?", "Tom");

// Query with Linq
var jills = batch.Query<User>().Where(x => x.FirstName == "Jill").ToList();

// Any() queries
var anyBills = batch.Query<User>().Any(x => x.FirstName == "Bill");

// Count() queries
var countJims = batch.Query<User>().Count(x => x.FirstName == "Jim");

// The Batch querying supports First/FirstOrDefault/Single/SingleOrDefault() selectors:
var firstInternal = batch.Query<User>().OrderBy(x => x.LastName).First(x => x.Internal);

// Kick off the batch query
await batch.Execute();

// All of the query mechanisms of the BatchQuery return
// Task's that are completed by the Execute() method above
var internalUser = await firstInternal;
Debug.WriteLine($"The first internal user is {internalUser.FirstName} {internalUser.LastName}");

Using the batch query is a four step process:

  1. Start a new batch query by calling IDocumentSession.CreateBatchQuery()
  2. Define the queries you want to execute by calling the Query() methods on the batch query object. Each query operator returns a Task<T> object that you’ll use later to access the results after the query has completed (under the covers it’s just a TaskCompletionSource).
  3. Execute the entire batch of queries and await the results
  4. Access the results of each query in the batch, either by using the await keyword or Task.Result.

 

A Note on our Syntax vis a vis RavenDb

You might note that the Marten syntax is quite a bit different syntax-wise and even conceptually to RavenDb’s Lazy Query feature. While we originally started Marten with the idea that we’d stay very close to RavenDb’s API to make the migration effort less difficult, we’re starting to deviate as we see fit. In this particular case, I wanted the API to be more explicit about the contents and lifecycle of the batched query. In other cases like the forthcoming “Include Query” feature, we will probably stay very close to RavenDb’s syntax if we don’t have any better ideas or strong reason to deviate from the existing art.

 

A Note on “Living” Documentation

I’ve received a lot of criticism over the years for having inadequate, missing, or misleading documentation for the OSS projects I’ve ran. Starting with Storyteller 3.0 and StructureMap 4.0 last year and now Marten this year, I’ve been having some success using Storyteller’s static website generation to author technical documentation in a way that’s been easy to keep code samples and content up to date with changes to the underlying tool. In the case of the batched query syntax from Marten above, the code samples are pulled directly from the acceptance tests for the feature. As soon as I made the changes to the code, I was able to update the documentation online to reflect the new syntax from running a quick script and pushing to the gh-pages branch of the Marten repository. All told, it took me under a minute to refresh the content online.

Storyteller, Continuous Integration, and the Art of Failing Fast

Someone today asked me if it were possible to use Storyteller 3 as part of their continuous integration builds. Fortunately, I was able to answer “yes” and point them to documentation on running Storyteller specifications with the headless “st run” command. One of my primary goals for Storyteller 3 was to make our existing continuous integration suites faster, more reliable, and for heaven’s sake, fail fast instead of trying to execute specifications against a hopelessly broken environment. You might not have any intention of every touching Storyteller itself, but the lessons we learned from earlier Storyteller and the resulting improvements in 3.0 I’m describing in this post should be useful for using any kind of test automation tooling.

How Storyteller 3 Integrates with Continuous Integration

While you generally author and even execute Storyteller specifications at development time with the interactive editor web application, you can also run batches of specifications with the “st run” command from the command line. By exposing this command line interface, you should be able to incorporate Storyteller into any kind of CI server or build automation tooling.

The results are written to a single, self-contained HTML file that can be opened and browsed directly (the equivalent report in earlier versions was a mess).

 

Acceptance vs. Regression Specs

This has been a bit of a will o’wisp always just out of reach for most of my career, but ideally you’d like the Storyteller specifications to be expressed – if not completely implemented – before developers start work on any new feature or user story. If you really can pull off “acceptance test driven development”, that means that you may very well be trying to execute Storyteller specifications in CI builds that aren’t really done yet. That’s okay though, because Storyteller let’s you mark specifications as two different “lifecycle” states:

  1. Acceptance – The default state, just tells Storyteller that it’s a work in progress
  2. Regression – The functionality expressed in a specification is supposed to be working correctly

For CI builds, you can either choose to run acceptance specifications strictly for informational value or leave them out for the sake of build times. Either way, the acceptance specs do not count toward the “st run” tool passing or failing the build. Any failures while running regression specifications will always fail the build though.

 

Being Judicious with Retries

To deal with “flaky”tests that had a lot of timing issues due to copious amounts of asynchronous behavior, the original Storyteller team took some inspiration from jQuery and added the ability to make Storyteller retry failing specifications a certain number of times and accept any later successes.

You really shouldn’t need this feature, but it’s an imperfect world and you might very well need this feature. What we found in earlier Storyteller though was that the retries were too generous and made the CI build times take far too long when things went off the rails.

In Storyteller 3, we adopted some more stringent guidelines for when and when not to retry specifications:

  1. The new default behavior is to not allow retries. You now have to opt into retries either on a specification by specification basis (recommended) or supply a default maximum retry count as a command line argument.
  2. Acceptance specifications are never retried
  3. Specifications will never be retried if an execution detects “critical” or “catastrophic” errors. This was done to try to distinguish between “timing errors” and cases where the system just flat out fails. The classic example we used when designing this behavior was getting an exception when trying to navigate to a new Url in a browser application.

 

Failing Faster This Time

Prior to Storyteller 3, our CI builds could go on forever when the system or environment was non-functional. Like many acceptance testing tools – and opposite of xUnit tools – Storyteller tries to run a specification from start to finish, even if any early step fails. This behavior is valuable when you have an expensive scenario setup for multiple assertions so that you can maximize the feedback when you’re attempting to fix the failures. Unfortunately, this behavior also killed us with runaway CI builds.

The canonical example my colleagues told me about was trying to navigate a browser to a new Url with WebDriver, the navigation failing with some kind of “YSOD”, but Storyteller still trying to wait until certain elements were visible — then add the retries into the now mess.

To alleviate this kind of pain, we invested a lot of time into making Storyteller 3 “fail fast” in its CI runs. Now, if Storyteller detects a “StorytellerCriticalExecution” or “StorytellerCatastrophicException” (the entire system is unresponsive), Storyteller 3 will immediately stop the specification execution, bypass any possible retries, and return the results so far. Underneath the covers, we made Storyteller treat any error in Fixture setup or teardown as critical exceptions.

“Catastrophic” exceptions would be caused by any error in trying to bootstrap the application or system wide setup or teardown. In this case, Storyteller 3 stops all execution and reports the results with the catastrophic exception message. Based on your own environment tests, users can also force a catastrophic exception that effectively sets the breaks on the current batch run (for things like “can’t connect to the database at all”).

This small change in logic has done a lot to stop runaway CI builds when things go off the rails.

 

Why so slow?

The major driver for launching the Storyteller 3 rewrite was to try to make the automated testing builds on a very large project much faster. On top of all the optimization work inside of Storyteller itself, we also invested in adding the collection of performance metrics about test execution to try to understand what steps and system actions were really causing the testing slowness (early adopters of Storyteller 3 have consistently described the integrated performance data as their favorite feature).

While all that performance data is embedded in the HTML results, you can also have that information dumped into either CSV files for easy import into tools like Excel or Access or exported as Storyteller’s own JSON format.

By analyzing the raw performance data with simple Access reports, I was able to spot some of the performance hot spots of our large application like particularly slow HTTP endpoints, a browser application probably being too chatty to the backend, and even spot pages that were slow to load. I can’t say that we have all the performance issues solved yet, but now we’re much more informed about the underlying problems.

 

Optimizing for Batch Execution

With Storyteller 3 I was trying to incorporate every possible trick we could think of to squeeze more throughput out of the big CI builds. While we don’t completely support parallelization of specification runs yet (but we will sooner or later), Storyteller 3 partially parallelizes the batch runs by using a cascading series of producer/consumer queues to:

  1. Read in specification data
  2. “Plan” the specification by doing all necessary data coercion and attaching the raw spec inputs to the objects that will execute each step. Basically, do everything that can possibly be done before actually executing the specification.
  3. Execute specifications one at a time

The strategy above can help quite a bit if you need to run a large number of small specifications, but doesn’t help much at all if you have a handful of very slow specification executions.