Taming Configuration at Scale

In earlier posts, we discussed how Splunk Cloud evolved from a stateful monolith into an elastic, cloud-native platform. But elasticity alone wasn’t enough. A lingering Achilles’ heel remained — configuration management.

Splunk’s on-prem model split cluster configuration across three major roles: ingestion, indexing, and search. Admins managed these settings manually, often SSHing into servers to copy or update config files. While workable in customer-controlled environments, this model completely broke down in the cloud.

The Problem: Manual Configuration Undermines Elasticity

In the cloud, customers no longer had direct access to servers. Instead, the configuration process translated into:

Operational bottlenecks: Customers filed support tickets for every config change, overloading internal ops teams.
Elasticity blockers: Configurations tied to physical nodes meant containers couldn’t be freely scheduled, scaled, or replaced.

This issue surfaced repeatedly in Splunk Answers and was cited as a top pain point in our Net Promoter Score (NPS) feedback. It became clear: configuration had to evolve — from manual and server-bound to automated and declarative.

The Insight: Decouple Config, Just Like State

Having previously helped decouple state from the application to enable elasticity, my team wondered:

> What if we decoupled configuration as well?

The naive answer was to centralize it — e.g., by using an external data store like PostgreSQL to drive configuration decisions. But that oversimplified the problem. The real issue wasn’t storage — it was fragmentation, lack of coordination, and the absence of orchestration across roles, components, and clouds.

The Solution: Admin Config Service (ACS)

From that realization, we designed and built ACS — the Admin Config Service — as a first-class cloud-native control plane for configuration management.

Just like our elasticity initiative, this effort brought together a wide coalition: infrastructure, platform, security, and product teams. ACS provided a consistent API surface to configure and control clusters across clouds and workloads.

Key architecture decisions included:

Private config listeners on every node: Each Splunk container ran an ACS listener, responsible for negotiating and syncing its local configuration. This turned out to be simple, as we reutilized existing C++ API.
Role-aware protocol design: Listeners understood cluster topology (leader, peer, forwarder, etc.) and adjusted configuration accordingly.
Cross-layer orchestration: ACS spoke to the platform controller and infrastructure APIs, coordinating across storage, compute, and security layers.

The Result: Instant Configuration, Infrastructure as Code

With ACS in place, config changes that used to take weeks via ticket workflows could now be resolved in minutes. Even better:

Self-service REST API: Customers could manage their Splunk Cloud clusters declaratively — integrating configuration into CI/CD and Infrastructure-as-Code (IaC) pipelines.
Cloud-agnostic rollout: ACS was deployed across both AWS and Google Cloud, providing a unified experience regardless of underlying infrastructure.
Scalable config propagation: Elasticity was fully realized — containers could be spun up, configured, and torn down dynamically without human intervention.

Reflections

ACS became a cornerstone in Splunk Cloud’s evolution — not just technically, but in how it reshaped customer experience. It tackled one of the most entrenched, manual parts of the stack and made it programmable, reliable, and fast.

This work reaffirmed a key lesson: meaningful cloud transformation isn’t just about compute and storage. It’s about control planes. And when configuration becomes an API, not a file, true cloud elasticity becomes possible.

The Problem: Manual Configuration Undermines Elasticity#

The Insight: Decouple Config, Just Like State#

The Solution: Admin Config Service (ACS)#

The Result: Instant Configuration, Infrastructure as Code#

Reflections#

The Problem: Manual Configuration Undermines Elasticity

The Insight: Decouple Config, Just Like State

The Solution: Admin Config Service (ACS)

The Result: Instant Configuration, Infrastructure as Code

Reflections