Earlier this year, I got an opportunity to work on an existing cloud provisioning automation effort that wasn’t going as well as leadership had hoped for. This was puzzling, as there is no shortage of scripting, APIs, and 3rd party tools available for automated cloud provisioning. Public cloud providers go out of their way in helping (for obvious reasons) corporations ramp up to their cloud offerings in an automated way. The platform teams that were tasked with this effort were also reasonably smart, motivated, and the goals they were shooting for felt reasonable and well thought out.
In around nine months, the teams are back on track, and have delivered an impressive 10x increase in automation compared to previous 24 months. There were no technology and or tooling changes, and along the way the teams also regrettably lost ~15% of the personnel.
Chesterton’s fence is the principle that reforms should not be made until the reasoning behind the existing state of affairs is understood.
In what ways did this metaphorical fence materialize for the automation teams?
- Assuming what users want
- Assuming existing tool set reuse
- Assuming that they could do it alone
Assuming what users want
The teams informed me that they were overly concerned about the lag and under resourcing in tooling and design changes required on the user experience for the new automation. This was attributed to be one of the biggest potential blockers for adopting the new automation.
We set up recorded video sessions for the development teams to understand the day to day work of what users were doing and how they did it. This built rapport and increased understanding of what the real problems were, unearthed unknown bottlenecks, and the project became more real.
Also, the apparent blockers on the lack of user experience tooling did not exist!
Assuming tool set reuse
The development teams designed their automation to orchestrate and reuse the existing tool sets their users used in their daily workflows.
Big mistake! The teams discounted just how much of the existing tool sets were dumb in the sense that they relied on human smarts to be used correctly and reliably. For example, this hid away a lot of missing complex business logic of capacity planning, and self healing that needed to be built up in the automation. As teams started to realize this shortcoming, they became more demanding of the existing tool set authors to enhance their offerings, thus making the new automation robust and reliable.
Assuming that they could do it alone
An extension to the previous assumption, troubleshooting and migrating an existing human driven process is surprisingly complex. No amount of documentation, ticket logs, post incident reviews can capture the richness and intricacy of the remediation process that humans perform to solve complex problems.
The teams started to partner up with existing expert operations teams on trouble shooting issues. This led to even better understanding of the edges of the existing process, and effective replacement designs in the automation for them.