tl;dr This post is an attempt to codify my thoughts about how to succeed with end to end integration testing. A toned down version of this post is part of the Storyteller 3 documentation.
About six months ago the development teams at my shop came together in kind of a town hall to talk about the current state of our automated integration testing approach. We have a pretty deep investment in test automation and I think we can claim some significant success, but we also have had some problems with test instability, brittleness, performance, and the time it takes to author new tests or debug existing tests that have failed.
Some of the problems have since been ameliorated by tightening up on our practices — but that still left quite a bit of technical friction and that’s where this post comes in. Since that meeting, I’ve been essentially rewriting our old Storyteller testing tool in an attempt to address many of the technical issues in our automated testing. As part of the rollout of the new Storyteller 3 to our ecosystem, I thought it was worth a post on how I think teams can be more successful at automated end to end testing.
I’ve worked in far too many environments and codebases where the automated tests were “flakey” or unreliable:
- Teams that do all of their development against a single shared, development database such that the data setup is hard to control
- Web applications with a lot of asynchronous behavior are notoriously hard to test and the tests can be flakey with timing issues — even with all the “wait for this condition on the page to be true” discipline in the world.
- Distributed architectures can be difficult to test because you may need to control, coordinate, or observe multiple processes at one time.
- Deployment issues or technologies that tend to hang on to file locks, tie up ports, or generally lock up resources that your automated tests need to use
To be effective, automated tests have to be reliable and repeatable. Otherwise, you’re either going to spend all your time trying to discern if a test failure is “real” or not, or you’re most likely going to completely ignore your automated tests altogether as you lose faith in them.
I think you have several strategies to try to make your automated, end to end tests more reliable:
- Favor white box testing over black box testing (more on this below)
- Closely related to #1, replace hard to control infrastructure dependencies with stub services, even in functional testing. I know some folks absolutely hate this idea, but my shop is having a lot of success in using an IoC tool to swap out dependencies on external databases or web services in functional testing that are completely out of our control.
- Isolate infrastructure to the test harness. For example, if your system accesses a relational database, use an isolated schema for the testing that is only used by the test harness. Shared databases can be one of the worst impediments to successful test automation. It’s both important to be able to set up known state in your tests and to not get “false” failures because some other process happened to alter the state of your system while the test is running. Did I mention that I think shared databases are a bad idea yet?*
- Completely control system state setup in your tests or whatever build automation you have to deploy the system in testing.
- Collapse a distributed application down to a single process for automated functional testing rather than try to run the test harness in a different process than the application. In our functional tests, we will run the test harness, an embedded web server, and even an embedded database in the same process. For distributed applications, we have been using additional .Net AppDomain’s to load related services and using some infrastructure in our OSS projects to coordinate the setup, teardown, and even activity in these services during testing time.
- As a last resort for a test that is vulnerable to timing issues and race conditions, allow the test runner to retry the test
Failing all of those things, I definitely think that if a test that is so unstable and unreliable that it renders your automated build useless that you just delete that test. I think a reliable test suite with less coverage is more useful to a team than a more expansive test suite that is not reliable.
You Gotta Have Continuous Integration
This section isn’t the kind of pound on the table, Uncle Bob-style of “you must do this or you’re incompetent” kind of rant that causes the Rob Conery’s of the world have conniptions. Large scale automation testing simply does not work if the automated tests are not running regularly as the system continues to evolve.
Automated tests that are never or seldom executed can even be a burden on a development team that still try to keep that test code up to date with architectural changes. Even worse, automated tests that are not constantly executed are not trustworthy because you no longer know if test failures are real or just because the application structure changed.
Assuming that your automated tests are legitimately detecting regression problems, you need to determine what recent change introduced the problem — and it’s far easier to do that if you have a smaller list of possible changes and those changes are still fresh in the developer’s mind. If you are only occasionally running those automated tests, diagnosing failing tests can be a lot like finding the proverbial needle in the haystack.
I strongly prefer to have all of the automated tests running as part of a team’s continuous integration (CI) strategy — even the heavier, slower end to end kind of tests. If the test suite gets too slow (we have a suite that’s currently taking 40+ minutes), I like the “fast tests, slow tests” strategy of keeping one main build that executes the quicker tests (usually just unit tests) to give the team reasonable confidence that things are okay. The slower tests would be executed in a cascading build triggered whenever the main build completes successfully. Ideally, you’d like to have all the automated tests running against every push to source control, but even running the slower tests suites in a nightly or weekly scheduled build is better than nothing.
Make the Tests Easy to Run Locally
I think the section title is self-explanatory, but I’ve gotten this very wrong in the past in my own work. Ideally, you would have a task in your build script (I still prefer Rake, but substitute MSBuild, Fake, Make, Gulp, NAnt, whatever you like) that completely sets up the system under test on your machine and runs whatever the test harness. In a less perfect world a developer has to jump through hoops to find hidden dependencies and take several poorly described steps in order to run the automated tests. I think this issue is much less problematic than it was earlier in my career as we’ve adopted much more project build automation and moved to technologies that are easier to automate in deployment. I haven’t gotten to use container technologies like Docker myself yet, but I sure hope that those tools will make doing the environment setup for automating tests easier in the future.
Whitebox vs. Blackbox Testing
I strongly believe that teams should generally invest much more time and effort into whitebox tests than blackbox tests. Throughout my career, I have found that whitebox tests are frequently more effective in finding problems in your system – especially for functional testing – because they tend to be much more focused in scope and are usually much faster to execute than the corresponding black box test. White box tests can also be much easier to write because there’s simply far less technical stuff (databases, external web services, service buses, you name it) to configure or set up.
I do believe that there is value in having some blackbox tests, but I think that these blackbox tests should be focused on finding problems in technical integrations and infrastructure whereas the whitebox tests should be used to verify the desired functionality.
Especially at the beginning of my career, I frequently worked with software testers and developers who just did not believe that any test was truly useful unless the testing deployment was exactly the same as production. I think that attitude is inefficient. My philosophy is that you write automated tests to find and remove problems from your system, but not to prove that the system is perfect. Adopting that philosophy, favoring white box over black box testing makes much more sense.
Choose the Quickest, Useful Feedback Mechanism
Automating tests against a user interface has to be one of the most difficult and complex undertakings in all of software development. While teams have been successful with test automation using tools like WebDriver, I very strongly recommend that you do not test business logic and rules through your UI if you don’t have to. For that matter, try hard to avoid testing business logic without using the database. What does this mean? For example:
- Test complex logic by calling into a service layer instead of the UI. That’s a big issue for one of the teams I work with who really needs to replace a subsystem behind http json services without necessarily changing the user interface that consumes those services. Today the only integration testing involving that subsystem is done completely end to end against the full stack. We have plenty of unit test coverage on the internals of that subsystem, but I’m pretty certain that those unit tests are too coupled to the implementation to be useful as regression or characterization tests when that team tries to improve or replace that subsystem. I’m strongly recommending that that team write a new suite of tests against the gateway facade service to that subsystem for faster feedback than the end to end tests could ever possibly be.
- Use Subcutaneous Tests even to test some UI behavior if your application architecture supports that
- Make HTTP calls directly against the endpoints in a web application instead of trying to automate the browser if that can be useful to test out the backend.
- Consider testing user interface behavior with tightly controlled stub services instead of the real backend
The general rule we encourage in test automation is to use the “quickest feedback cycle that tells you something useful about your code” — and user interface testing can easily be much slower and more brittle than other types of automated testing. Remember too that we’re trying to find problems in our system with our tests instead of trying to prove that the system is perfect.
Setting up State in Automated Tests
I wrote a lot about this topic a couple years ago in My Opinions on Data Setup for Functional Tests, and I don’t have anything new to say since then;) To sum it up:
- Use self-contained tests that set up all the state that a test needs.
- Be very cautious using shared test data
- Use the application services to set up state rather than some kind of “shadow data access” layer
- Don’t couple test data setup to implementation details. I.e., I’d really rather not see gobs of SQL statements in my automated test code
- Try to make the test data setup declarative and as terse as possible
Test Automation has to be a factor in Architecture
I once had an interview for a company that makes development tools. I knew going in that their product had some serious deficiencies in their automated testing strategy. When I told my interviewer that I was confident that I could help that company make their automated testing support much better, I was told that testing was just a “process issue.” Last I knew, it is still weak for its support for automating tests against systems that use that tool.
Automated testing is not merely a “process issue,” but should be a first class citizen in selecting technologies and shaping your system architecture. I feel like my shop is far above average for our test automation and that is in no small part because we have purposely architected our applications in such a way to make functional, automated testing easier. The work I described in sections above to collapse a distributed system into one process for easier testing, using a compositional architecture effectively composed by an IoC tool, and isolatating business rules from the database in our systems has been vital to what success we have had with automated testing. In other places we have purposely added logging infrastructure or hooks in our application code for no other reason than to make it easier for test automation infrastructure to observe or control the application.
Other Stuff for later…
I don’t think that in 10 years of blogging I’ve ever finished a blog series, but I might get around to blogging about how we coordinate multiple services in distributed messaging architectures during automated tests or how we’re integrating much more diagnostics in our automated functional tests to spot and prevent performance problems from creeping into application.
* There are some strategies to use in testing if you absolutely have no other choice in using a shared database, but I’m not a fan. The one approach that I want to pursue in the future is utilizing multi-tenancy data access designs to create a fake tenant on each test run to keep the data isolated for the test even if the damn database is shared. I’d still rather smack the DBA types around until they get their project automation act together so we could all get isolated databases.
20 thoughts on “Succeeding with Automated Integration Tests”
Great article! A lot of insight here. This helps provide validation to many of the changes we are making in our organization as well.
Great read — thank you!
Could you clarify this part though?
“Don’t couple test data setup to implementation details. I.e., I’d really rather not see gobs of SQL statements in my automated test code.”
I understand the point you are making, but I don’t understand the *how* of it — for example, how do I test my DAO-layer without using SQL to setup the associated AFT?
Or, are you really saying to not test the DAO-layer, to just start testing at the Domain-layer?
That’s probably a blog post by itself. Sure, *something* has to end up generating the Sql or writing to files or whatever state exists, but I usually don’t want that raw SQL in the test itself. I like to use something conceptually similar to this: http://www.natpryce.com/articles/000714.html.
I prefer to have some kind of intermediate (hopefully very declarative) language you use in the test expression itself that then delegates to the details of the database. For us, we use Storyteller for most of our integration and acceptance tests, and the data setup generally ends up being declarative tables of data. It’s extra work to do that, but your test data setup can:
* Fill in default values that the test does not care about but is necessary for the database constraints
* Decouple your test somewhat from database changes. Say that a database table gets a new NOT NULL column. Instead of updating all the tests that use that table, maybe you can just have the “Test Data Builder” code fill in the new column instead of changing SQL everywhere.
Oooh… I see what you are saying now. Good info, thank you for the clarification. 🙂
Some great insights here. Great article!
Reblogged this on kauflin.
Good stuff, as usual.
I’m still waiting impatiently for the big DB vendors to offer true in-memory options with most of the features intact. Then, fast test setup/teardown against the actual DB could be a real possibility.
Jeremy, good stuff as always. I am putting together an Automated Testing presentation for my organization and this is one of several very useful resources I am using – practical views as always. Nine years on from attending in that first Alt.Net conference in Austin and I am satisfied to say my dev team creates/maintains a well-covered code base, lives and breathes SOLID (however still struggles with deep pattern insight that should flow from these principles) and understands well the balance of coupling/cohesion.
Anyway, as I preside over a service/messaging architecture, with many (often with stepwise-execution) services I am keen to understand just a little bit (even just a 3000 ft conceptual telling) of “how we coordinate multiple services in distributed messaging architectures during automated tests”. It doesn’t need to be extensive just some key pointers. We did try to accomplish this with NUnit as the harness (running each service in a unit test) but ran into a problem with this (and other frameworks) lack of support for test ordering. Of course the tool is intended to support unit and integration testing so it is natural that first thing we hear when bring this up (ordered tests) is “you’re not using the tool correctly”, which of course is true. We DO use the tool correctly for the 8000+ unit and integration tests we maintain (they run parallel and independent) but we’d like to find a low-friction approach to stitching together (say) 6 services to run sequentially as an acceptance/regression test of our long running processes.
Any pointers or admonitions would be helpful 🙂
Sorry I missed this — and man, it’s been awhile.
The problem here is that all the stuff we did to make automated testing on our messaging bus stuff work is tied to
elements of FubuMVC, but you can read up on what we do here: