Ops in Devops

Setting up devops is a learning journey for most programmers.

Early in my engineering career, before cloud computing was invented, making my programs work in my company’s datacenter could be summed up as making the program:

configurable: by having no hardcoded paths and secrets in the code, supporting feature flags (both static - which were driven through property files, and dynamic - driven through databases)
deployable: in different machines other than mine. Typically, done by building and bundling scripts to automate program installation.
observable: support consistent logging, and hooks to enable tracing for distributed systems. This was invaluable in troubleshooting errors

Operations in cloud computing encompass addressing more concerns than making programs configurable, deployable, and observable.

In cloud computing, programs perform the role of value added cloud services.

Cloud service programs run on leased cloud provider’s (example: AWS, GCP) infrastructure (electricity, cabling, buildings) and hardware (computers, routers). Cloud services are made available and accessible over the public internet and are either used by web, desktop, and mobile applications or also by other cloud services.

Operations teams manage customer and user expectations for the organization. Users expect applications to work with minimal or no interruptions, and even in case of errors, expect the cloud solution to continue working in a degraded fashion. Customers and users also expect careful stewardship of their information, that it remains secure, tamper resistent, and is never lost.

A cloud service’s existing level of functionality is gauged through its service level indicator (SLI). The SLIs depend on the type of service being designed. There are special purpose programs designed for capturing point in time SLI values from cloud services, aggregating them (SLI values), and alerting engineering teams when they degrade.

For example, a banking cloud service could have different service level indicators for money transfer, deposit, and withdrawal capabilities. Degradation in these SLIs can alert both cloud service engineering teams and customers.

Additionally, shareholders of the cloud service organization expect operations teams to keep cost of running cloud services optimal and manageable. For example, they expect operations teams to be able to either scale up (or out) resources as demand for cloud services increase or shrink down if the demand ebbs.

Elasticity is a design and architecture pattern that special purpose control plane cloud services use to scale up/ out and shrink cloud services. This key capability goes to the heart of the existence of the cloud, as cloud providers charge their customers by resources used over time, and provide automated APIs for customers to manage resources.

A typical control plane cloud service captures resource utilization heartbeats from systems that host cloud services, and then compare them with cloud service usage SLIs (for example how many deposits are being requested by customers).

As and when there is resource contention detected (for example high CPU usage on machines running the banking cloud service, and the deposit SLI shows an increasing uptick in usage), then a control plane service should automatically provision new servers, configure them, install cloud service (in our example the banking cloud service), and start it for users. Correspondingly, when idle resources are detected, and usage SLIs are low, the control plane cloud service will shut down systems, and save costs. AWS Lambda is an example of a common cloud control plane elasticity pattern that customers can use out of the box. Serverless is the name of the associated industry trend.

So far, we’ve been assuming cloud systems running perfectly. The reality is far from that. Systems fail, networks become congested, programs crash, new changes (typically configuration) disrupt functionality in systems all the time. As and when cloud SLIs degrade, then the engineering team is engaged to manage and mitigate this service “incident”.

Another big consideration is managing security and privacy of customer and user information. I wrote a blog post on cloud compliance that goes in detail on additional constraints on operational teams to have a certified process.

This article provides a taste of the work being done by a devops cloud service engineering team.