Configuration Drifts and Fallbacks: Lessons Learned

2026-02-17

Two weeks prior, we finally did a client side deployment of our application that was three months into development. And almost nothing went as expected. Even after peer review of every PR, maintaining a documentation, and manageable Dockerfiles, we still managed to get hit by uncertainties. There were a lot of issues at hand, some misinterpretations of the client's requirements, lack of feasibility of our strategy, a sudden change in scale of the data we were working with. But I think the one I learned the most from was configuration drift.

I don't know if this is even the correct term. I have an academic background in Mechatronics, and so I have some experience working with sensors. Sensor drift is "the gradual and unintended change in a sensor's output signal over time, despite the measured input remaining constant". We had to work with a lot of secrets and environment variables. There was one source of truth example .env file that we followed. However, during the project, on multiple occasions, each of us had to change one or two variables to our liking, and some of those we forgot to document. So what happened was that the final output, or how the application worked, remained visibly the same in all of our machines. However, each of us were running with a different set of almost indistinguishable configurations. So when it finally came to deployment, and we started testing, there were inconsistent metrics all around. This was the silent killer that had us scratching our heads and rushing to deploy hot-fixes for one week straight.

The other valuable lesson I have now internalized on me is to never trust in fallbacks. We had fallbacks implemented for making our application run even when certain services were unavailable. The lack of an explicit warning prompted us to never actually check what was going wrong in the first place, and our assumption about how a certain part of our system was working was heavily misguided. I think now I appreciate more on the philosophy of explicit error handling and verbose logging.

We did manage to deliver the deployment as intended despite the hiccups. But it didn't need to be this exhausting. We ended up working extra hours for one whole week, and that's a telltale sign that something has massively gone wrong. I will certainly take the lessons to my heart and take them to whatever I work on next.