Thoughts on Agile Database Development

I’m flying out to our main office next week and one of the big things on my agenda is talking over our practices around databases in our software projects. This blog post is just me getting my thoughts and talking points together beforehand. There are two general themes here, how I’d do things in a perfect world and how to make things better within the constraints of the organization and software architecture that have now.

I’ve been a big proponent of Agile development processes and practices going back to the early days of Extreme Programming (before Scrum came along and ruined everything about the way that Scrappy ruined Scooby Doo cartoons for me as a child). If I’m working in an Agile way, I want:

  1. Strong project and testing automation as feedback cycles that run against all changes to the system
  2. Some kind of easy traceability from a built or deployed system to exactly the version of the code and its dependencies , preferably automated through your source control processes
  3. Technologies, tools, and frameworks that provide high reversibility to ease the cost of doing evolutionary software design.

From the get go, relational databases have been one of the biggest challenges in the usage of Agile software practices. They’re laborious to use in automated testing, often expensive in time or money to install or deploy, the change management is a bit harder because you can’t just replace the existing database objects the way we can with other code, and I absolutely think it’s reduces reversibility in your system architecture compared to other options. That being said, there are some practices and processes I think you should adopt so that your Agile development process doesn’t crash and burn when a relational database is involved.

Keep Business Logic out of the Database, Period.

I’m strongly against having any business logic tightly coupled to the underlying database, but not everyone feels the same way. For one reason, stored procedure languages (tSQL, PL/SQL, etc.) are very limited in their constructs and tooling compared to the languages we use in our application code (basically anything else). Mostly though, I avoid coupling business logic to the database because having to test through the database is almost inevitably more expensive both in developer effort and test run times than it would be otherwise.

Some folks will suggest that you might want to change out your database later, but to be honest, the only time I’ve ever done that in real life is when we moved from RavenDb to Marten where it had little impact on the existing structure of the code.

In practice this means that I try to:

  1. Eschew usage of stored procedures. Yes, I think there are still some valid reasons to use sprocs, but I think that they are a “guilty until proven innocent” choice in almost any scenario
  2. Pull business logic away from the database persistence altogether whenever possible. I think I’ll be going back over some of my old designing for testability blog posts from the Codebetter/ALT.Net days to try to explain to our teams that “wrap the database in an interface and mock it” isn’t always the best solution in every case for testability
  3. Favor persistence tools that invert the control between the business logic and the database over tooling like Active Record that creates a tight coupling to the database. What this means is that instead of having business logic code directly reading and writing to the database, something else (Dapper if we can, EF if we absolutely have to) is responsible for loading and persisting application state back and forth between the domain in code and the underlying database. The point is to be able to completely test your business logic in complete isolation from the database.

I would make exceptions for use cases where using the database engine to do set based logic in a stored procedure is a more efficient way to solve the problem, but I haven’t been involved in systems like that for a long time.

 

Database per Developer/Tester/Environment

My very strong preference and recommendation is to have each developer, tester, and automated testing environment using a completely separate database. The key reason is to isolate each thread of team activity to avoid simultaneous operations or database changes from interfering with each other. Sharing the database makes automated testing much less effective because you often get false negatives or false positives from database activity going on somewhere else at the same time — and yes, this really does happen and I’ve got the scars to prove it.

Additionally, it’s really important for automated testing to be able to tightly control the inputs to a test. While there are some techniques you can use to do this in a shared database (multi-tenancy usage, randomized data), it’s far easier mechanically to just have an isolated database that you can easily control.

Lastly, I really like being able to look through the state of the database after a failed test. That’s certainly possible with a shared database, but it’s much easier in my opinion to look through an isolated database where it’s much more obvious how your code and tests changed the database state.

I should say that I’m concerned here with logical separation between different threads of activity. If you do that with truly separate databases or separate schemas in the same database, it serves the same goal.

“The” Database vs. Application Persistence

There are two basic development paradigms to how we think about databases as part of a software system:

  1. The database is the system and any other code is just a conduit to get data back and forth from the database and  its consumers
  2. The database is merely the state persistence subsystem of the application

I strongly prefer and recommend the 2nd way of looking at that, and act accordingly. That’s a admittedly a major shift in thinking from traditional software development or database centric teams.

In practice, this generally means that I very strongly favor the concept of an application database that is only accessed by one application and can be considered to be just part of the application. In this case, I would opt to have all of the database DDL scripts and migrations in the source control repository for the application. This has a lot of benefits for development teams:

  1. It makes it dirt simple to correlate the database schema changes to the rest of the application code because they’re all versioned together
  2. Automated testing is easier within continuous integration builds becomes easier because you know exactly what scripts to apply to the database before running the tests
  3. No need for elaborate cascading builds in your continuous integration setup because it’s just all together

In contrast, a shared database that’s accessed by multiple applications is a lot more potential friction. The version tracking between the two moving parts is harder to understand and it harms your ability to do effective automated testing. Moreover, it’s wretchedly nasty to allow lots of different applications to float on top of the same database in what I call the “pond scum anti-pattern” because it inevitably causes nasty coupling issues that will almost result in regression bugs due to it being so much harder to understand how changes in the database will ripple out to the applications sharing the database. A much, much younger version of myself walked into a meeting and asked our “operational data store” folks to add a column to a single view and got screamed at for 30 minutes straight on why that was going to be impossible and do you know how much work it’s going to be to test everything that uses that view young man?

Assuming that you absolutely have to continue to use a shared database like my shop does, I’d at least try to ameliorate that by:

  • Make damn sure that all changes to that shared database schema are captured in source control somewhere so that you have a chance at effective change tracking
  • Having a continuous integration build for the shared database that runs some level of regression tests and then subsequently cascades to all of the applications that touch that database being automatically updated and tested against the latest version of the shared database. I’m expecting some screaming when I recommend that in the office next week;-)
  • At the least, have some mechanism for standing up a local copy of the up to date database schema with any necessary baseline data on demand for isolated testing
  • Some way to know when I’m running or testing the dependent applications exactly what version of the database schema repository I’m currently using. Git submodules? Distribute the DB via Nuget? Finally do something useful with Docker, distribute the DB as a versioned Docker image, and brag about that to any developer we meet?

The key here is that I want automated builds constantly running as feedback mechanisms to know when and what database changes potentially break (or fix too!) one of our applications. Because of some bad experiences in the past, I’m hesitant to use cascading builds between separate repositories, but it’s definitely warranted in this case until we can get the big central database split up.

At the end of the day, I still think that the shared database architecture is a huge anti-pattern that most shops should try to avoid and I’d certainly like to see us start moving away from that model more and more.

 

Document Databases over Relational Databases

I’ve definitely put my money where my mouth is on this (RavenDb early on, and now Marten). In my mind, evolutionary or incremental software design is much easier with document databases for a couple reasons:

  • Far fewer changes in the application code result in database schema changes
  • It’s much less work to keep the application and database in sync because the storage just reflects the application model
  • Less work in the application code to transform the database storage to structures that are more appropriate for the business logic. I.e., relational databases really aren’t great when your domain model is logically hierarchical rather than flat
  • It’s a lot less work to tear down and set up known test input states in document databases. With a relational database you frequently end up having to deal with extraneous data you don’t really care about just to satisfy relational integrity concerns. Likewise, tearing down relational database state takes more care and thought than it does with a document database.

I would still opt to use a relational database for reporting or if there’s a lot of set based logic in your application. For simpler CRUD applications, I think you’re fine with just about any model and I don’t object to relational databases in those cases either.

It sounds trivial, but it does help tremendously if your relational database tables are configured to use cascading deletes when you’re trying to set a database into a known state for tests.

Team Organization

My strong preference is to have a completely self-contained team that has the ability and authority to make any and all changes to their application database, and that’s most definitely been valid in my experience. Have the database managed and owned separately from the development team is a frequent source of friction and definitely a major hit to your reversibility that forces you to do more potentially wrong, upfront design work. It’s much worse when that separate team does not share your priorities or simply works on a very different release schedule. I think it’s far better for a team to own their database — or at the very worst, have someone who is allowed to touch the database in the team room and team standup’s.

If I had full control over an organization, I would not have a separate database team. Keeping developers and database folks on separate team makes your team have to spend more time on inter-team coordination, takes away from the team’s flexibility in deciding what they can deliver, and almost inevitably causes a bottleneck constraint for projects. Even worse in my mind is when neither the developers nor the database team really understand how their work impacts the other team.

Even if we say that we have a matrix organization, I want the project teams to have primacy over functional teams. To go farther, I’d opt to make functional teams (developers, testers, DBA’s) be virtual teams solely for the purpose of skill acquisition, knowledge sharing, and career growth. My early work experience was being an engineer within large petrochemical project teams, and the project team dominant matrix organization worked a helluva lot better than it did at my next job in enterprise IT that focused more on functional teams.

As an architect now rather than a front line programmer, I constantly worry about not being able to feel the “pain” that my decisions and shared libraries cause developers because that pain is an important feedback mechanism to improve the usability of our shared infrastructure or application architecture. Likewise, I worry that having a separate database team creates a situation where they’re not very aware of the impact of their decisions on developers or vice versa. One of the very important lessons I was taught as an engineer was that it was very important to understand how other engineering disciplines work and what they needed so that we could work better with them.

Now though, I do work in a shop that has historically centralized the control of the database in a centralized database team. To mitigate the problems that naturally arise from this organizational model, we’re trying to have much more bilateral conversations with that team. If we can get away with this, I’d really like to see members of that team spend more time in the project team rooms. I’d also love it if we could steal a page from my original engineering job (Bechtel)  and suggest some temporary rotations between the database and developer teams to better appreciate how the other half of that relationship works and what their needs are.

 

 

 

18 thoughts on “Thoughts on Agile Database Development

  1. Great post as always Jeremy.

    I agree with all your points re. keeping logic out of the database, a database per developer and also per environment etc.
    I’ve only ever really used relational databases, keeping the database DDL & per-sprint migration scripts under source control, so schema migration and data seeding is quite easy and repeatable in every environment using tools like DbUp, RedGate Sql Source Control (or a variety of home grown solutions to achieve same). Having your database and migrations under source control is imperative to CI IMO.

    However, it’s still unclear to me how to achieve schema migration and data seeding in a document db as despite it’s flaws, sql is great for DDL and data seeding. Googling hasn’t revealed any similar tools in the document db space – perhaps I’m thinking too relationally about it? Most posts I’ve seen relating say this should be done in application code, but this sounds like your app code would become very polluted.

    Wondering if you can you give any insight on what a document db equivalent to would be to a sql migration script that adds a new column with a default or creates a new table seeded with a bunch of reference data?

      1. I strongly agree with that sentiment as well. I tried changing my job for several months, and then changed my high instead. Kind of sad that in a market with ridiculous demand for developers that I wind up feeling like I need to cling to my current one for dear life, but experiences like that last are just too common.

  2. I think biggest obstacle you often run into with this sort of design is security. I’ve encountered companies, where security is the biggest concern in the design of the application and it can override many of architectural goals you might like to have for an application. e.g they can insist on employing security in the database that disallows direct access to tables other than through stored procs. They’re basically against any kind of access to data that allows dynamic querying and this drives a lot of the business logic into the DB. When dealing with personal data of clients and the reputation of large international software companies, I think these concerns are fair enough.

    Another big piece that is missing on this topic, is code generation. I’m not sure how or why tools like Code Smith and My Generation went out of flavour (about the time entity framework came into existance). Strangely enough they went out of flavour round the time Microsoft introduced constructs into .NET such as partial classes and partial methods that make extensibility of code generation code very easy.

    You can do a lot with code generation, to save yourself from worrying about differences between your application and database models. It can help bridge the gap between application domains, which seemed be your primary motivation for using Document Databases. Do you think there is much of a trade off in performance between using Document and Relational databases? Obviously I’m guessing the answer depends on the application.

    1. Oh, sorry, you’ve got to count me in the crowd of folks who despised the old “let’s codegen our application from a database schema!” tools. My take on code generation like that is at best, it’s a stand in solution until you figure out a less crappy way to get things done.

      1. Whilst applications and databases are separate domains there’s a need to bridge them.

        I’d love to hear a coherent argument about why creating you data model in your application and then replicating it in a database using ORM is any better than creating your data model in a database and using a code gen tool to replicate that model in your application. Why is code gen crappy? Doesn’t it solve the exact same problem?

        The other option is that you keep two identical data models, one in a database, the other in your application in sync by hand. To me that’s the worst of all three options, as humans pretty crappy at keeping two different things identical.

        Or is there another option I’ve somehow missed?

  3. @Mick,

    If you followed me you’d know that I’m very down on the usage of ORM’s, at least heavyweight ones. My strong preference is to use a document db approach where there’s no mapping necessary between the real application model and what the db stores.

    Even so, I’d still choose an ORM over the codegen so that I can focus on writing the application code and worry about persistence later. I think the domain model backed by an ORM gives you more reversibility than the codegen approach and much more flexibility in structuring your domain model. Kinda feel like this is a very old argument that was mostly settled years ago. I’m genuinely surprised to see anyone still wanting to use tools like Subsonic or LLBLGen.

    The codegen approach creates an excruciatingly tight coupling between the application and the database that is very harmful IMO. Maybe for CRUD applications, but even then, I’d rather use something like Dapper.

    1. Well I’ve been genuinely surprised for some time that no one uses them, especially in preference to T2 code generation (which definitely is crappy). We define the data model using source controlled VS DB projects,and use code gen to partially generate the entities, interfaces, and repositories against the database.

      I prefer code gen as I think a database is the best place to define the data model, plus with a decent code gen tool you can access the whole schema, foreign keys, indexes, constraints etc and create code gen scripts that generate methods on repositories which automate the validation of data, maximize the usage of indexes and the SQL Server plan cache.

      How tightly coupled code is to the database I think depends on the design of the system and the code that is generated. The code we generate adheres to SOLID principles, plus there’s layers on top of it. I wouldn’t consider the system we’ve created tightly coupled to the database, perhaps to a data model, but not to a specific implementation of that data model.

      I hadn’t heard of Dapper till reading this article. Thanks, it definitely looks worth investigating (the rest of your article is pretty interesting too).

      I think when it comes to using ORMs especially linq and EF its very easy to produce bad code, i.e. executing queries that don’t utilize indexes, or produce SQL that doesn’t maximise the use of the DB plan cache. But its not impossible to produce code in EF that doesn’t do this.

  4. Thanks Jeremy for the great article on an interesting topic.

    The approach to agile database application development I favour is almost completely opposite to yours, for example I use stored procedures and table functions heavily as an API to the DB; and use code generation tools frequently – albeit old home-made ones I tweak with every new project.

    I’m sure we agree the key to doing a great job, one that covers off concerns such as security; performance; testabilty; maximum simplicity and manageability; TCO; and mercy for the users and technical maintainers of the future, is to reflect upon how you deliver better and better software each time you start a new project.

    You do make a compelling case for your approach and it is a good approach, yet equivalent cases exist for radically different alternatives.

    Whatever considered path you take, it will have a lesser impact on project goals than the less-technical aspects of working in software development, such as the gulf between the requirements relayed to the team and what the client actually requires – or poor team relationships. It was demoralising to read the comment from the codevergnugen about being made a pariah for advocating the ideas you’ve presented, that’s the evidence of thoughtfulness about quality that IT managers should value highly and let their teams run with regardless of their own views about how things are “best” done.

  5. This is a phenomenal read! I’m glad someone took the time to lay out the cases for each point – too often I run into colleagues who are brainwashed to “always use stored procedures”, so this may help me to converse about better alternatives for building higher quality, testable systems. I find your points about the development paradigms a bit depressing, as I’m a resounding ‘2’ but surrounded by career ‘1’s. On our team, DB change scripts are capture via cutting edge tools like notepad and Windows 7 sticky notes. Sproc breaks happen often and usually have significant severity (in all environments). Database schema meetings involve more time than focusing on the solution the business actually needed! Domain thinking allows me to craft extensible business rules that are easy to change, not break, and I’d have to strongly agree with your philosophy on development.

    Keep up the great work! Cheers!

  6. Different application types require different design patterns. “Keep Business Logic out of the Database, Period.” is just too ignorant (am I am being polite here) at least for highly transactional types of applications.

    1. Shrug. I’ve seen plenty of high volume systems that successfully kept business logic out of the database. Arguably, the db is generally your scalability bottleneck, so I’d say it’s even more important to keep unnecessary work out of the database itself that could be somewhere else.

      1. Code written by developers who don’t understand how the database works is the main scalability bottleneck in the most cases, not the database itself. I totally agree with you that some work must be done in the database and other that must be done somewhere else. This is why we have OLTP, OLAP and Big Data systems for example, each designed for specific types of jobs. Get them mixed up is a straight way to failure.

Leave a comment