To be as clear as possible, I’m speaking only for myself and any views or opinions expressed here do not represent my employer whatsoever.
Years ago I wrote a post called The Surprisingly Valuable and Lasting Lessons I Learned from a Horrible Project that’s exactly what it sounds like. I was on a genuinely awful project ripe with Dilbertesque elements. As usual though, horrible experiences can lead to some very valuable lessons. From that terrible project, I learned plenty about team communication, Agile processes, and doing software design within an Agile process (XP in this case).
If you’ve followed my twitter feed the past couple years, you know that I have routinely expressed some frustrations working within a long running waterfall project. I finally rolled off that project this past Friday after 2+ years, and I have some thoughts. I definitely appreciate some of the personal relationships that came out of the project, but I’m not leaving feeling very satisfied by how the project went overall. That makes it time to reflect and figure out what actionable lessons I can take from the project for future endeavors.
Because it’s an easy writing crutch, let’s go to the good, the bad, and the ugly of the project and what I (re)learned this time around:
The Good
Let’s start with some positives. I’ve been a big believer in capturing business rule requirements as “executable specifications” (acceptance test driven development or some folk’s version of behavior driven development) for years. While we weren’t allowed by the client to use my preferred tooling (grumble) to do that and had to write some custom tooling, we had some genuine success with executable specifications as requirements. Think “if we have this exact inputs and run the business rules, these are the exact validation errors and/or transactions that should be triggered next.” Our client had some very involved business rules that were driven by even more complex databases, and having the client review and adjust our acceptance tests written as examples instead of just sticking with vaguely worded Word documents made a hugely positive difference for us.
Automating all deployments through Azure DevOps was a big win, especially when we finally got folks outside of our organization to stop deploying directly from Visual Studio.Net. It’s vitally important to have good traceability from the binaries deployed in testing or production to the revision of the code in source control. I learned that way back when in my previous “worst project ever”, and we re-learned that in this project. The one aspect of this project that was an undeniable success was introducing continuous integration and some continuous delivery to our customer.
The Bad
Developers that do not communicate or interact well with other developers simply cannot be placed in important positions in integration projects regardless of their technical ability or domain knowledge.
We started adding some environment tests (with an ancient, but still working feature in StructureMap!) into our deployment pipelines about halfway, but I wish we’d gone much farther. The overall technical ecosystem was not perfectly reliable and there were dependencies that still had manual deployments, so it became very important to have self-diagnosing deployments that could tell you quickly when things like database connectivity, configuration to external dependencies, or expected network shares were unreachable.
I was on a call this morning for a new project just getting off the ground and wanted to give a giant high five through Zoom to our DevOps architect when he showed how he was planning to build environment tests directly into this new project’s CD pipeline.
Conway’s Law is not evil per se, but it does exist and you absolutely need to be aware of it when you lay out both your desired architecture and organizational structure. We were absolutely screwed over by Conway’s Law on this project, and the consequence to the customer is a system that has performance problems and is harder to troubleshoot and support than it needed to be because the service boundaries fell in problematic ways based on the organizational model rather than on what would have made sense from a technical or even problem domain perspective.
I wish we’d engaged much earlier with the client’s operations team. It was waterfall after all, so they only expected to get support and troubleshooting documentation near the end of the project. That said, while I still think our basic architecture was appropriate from a technical perspective, it didn’t at all fit into the operation team’s comfort zone and the customer’s existing infrastructure for application support. Either we could have worked with them to make them comfortable with alternative tooling for operations monitoring much earlier in the project, or we could have bitten the bullet and made the systems act much more like the batch driven mainframe tools they were used to.
The error handling we designed into our asynchronous support was heavily based off of Jasper’s existing error handling, which in turn grew out of my experiences in my previous company where we dealt with large volumes and frequent transient errors. In this ecosystem, our problems were really more systematic when a downstream system would be either completely down or mis-configured so that every interaction with it failed. In this case, we really needed a circuit breaker strategy for the error handling inside the message handling code. The main lesson here is to be careful you aren’t trying to fight the last war.
I wish there had been an overall architect over all the elements of this large initiative (preferably me). I only had a window into the specific elements my teams were building out early on, and didn’t get to see the bigger picture until much later — assuming that I ever did. Everyone else was in the same boat, and I felt like we were all the proverbial blind men trying to describe an elephant by feel.
The Ugly
It’s hard to describe, but my single biggest regret on this project was in not pushing much harder to create an effective automated testing strategy against our integration with an extremely problematic 3rd party system. Having to depend so much on very slow, laborious manual testing against this 3rd party system at the center of everything was the bottleneck of all our efforts in my opinion. Most of our technical risk and production issues have been related to this 3rd party system. A fully automated test suite might have allowed us to iterate much faster and find & remove the problems we found in the integration.
When you have the choice, don’t write custom infrastructure when there are viable, commonly used, off the shelf components. The senior management at the beginning of this project were very apprehensive of using any kind of new infrastructure, especially open source tools.
Since this was primarily an integration project, asynchronous messaging should have been a very good fit for the project. We wrote a tiny shared library for managing asynchronous communication between applications using Rabbit MQ as the underlying transport, but with some hooks for an easy move later to Azure Service Bus. That tiny library had to continuously evolve into something much larger later as the use cases became more complex to the point where I felt like it was dominating our workload.
I won’t even say “in retrospect” because we knew this full well from day one, but the project would have gone better if we’d been able to use an off the shelf toolset like MassTransit, NServiceBus, or my own Jasper framework for the messaging support. I wish that I’d made a much bigger push at the time to build out a much more robust messaging foundation, but I felt lucky at the time just to get Rabbit MQ approved.
At the end we actually had a consensus agreement to rip out our custom messaging library and replace that with MassTransit, but the clock ran out on us. If and when the customer is able to do that themselves, I think they’ll have much more robust error handling and instrumentation that should result in more successful daily operations.
There were more egregious “NIH” violations than what I described above, but I’m only going to deal with issues where I had some level of control or influence.
The waterfall software process on this project was just as problematic as it ever was. We had to spend a lot of energy upfront on intermediate deliverables that didn’t add much value, but the killer as usual with waterfall was how damn slow the feedback cycles were and not doing any substantial integration testing until very late in the project. I’m aware that many of you reading this will have very negative opinions and experiences with Agile (I blame Scrum though), but Agile done reasonably well means having rapid and early feedback cycles to find and fix problems quickly.
Shared databases are a common scourge of enterprise architectures. Dating all the way back to my 2005 (!) post Overthrowing the Tyranny of the Shared Database, sharing databases between an application has been a massive pet peeve of mine. Hell, tilting at the shared database windmill at my previous company contributed a little bit to me leaving. At the very least, make sure the %^$&$^&%ing shared database structure is completely described in source control somewhere and fully scripted out so any developer can quickly spin up an up to date copy of that database for testing as needed. If you depend on manual database changes independent of the application development around the shared database, you need to expect a great deal of friction and production problems related to your shared database.
One more time with feeling for my longtime readers:
Sharing a database between applications is like drug users sharing needles
Things to research for later
The big takeaways from me on this project are to add some additional error handling and distributed tracing approaches to my integration project tool belt. As soon as I get a chance, I’m doing a deeper dive into the OpenTelemetry specification with a thought toward adding direct support in Jasper and maybe Marten as a learning experience. I’m also going to add some circuit breaker support directly into Jasper.
For any of you who are huge fans of Stephen King’s Dark Tower novels, you know that King modeled Roland on Clint Eastwood’s character from the spaghetti westerns, but living inside of a Lord of the Rings style epic tale. I think Idris Elba would have been awesome as Roland in the Dark Tower movie if they hadn’t changed the story and the character so much from the books. Grrr.
The movie is a sequel to the books, ie the next cycle.
Oh man, the end of the series was painful the first time around
Hi Paul,
This is a remarkable testimonial, also very useful. I appreciate the humility that is needed to pinpoint the things that could’ve been done differently. I wonder though, what feedback from the client was in the end… any “mea culpa”? Cheers, thanks.
So, hi, I’m Jeremy. And I think it’s pretty clear in the post that there were things I wish I and the rest of the project had done differently.