Building Elasticity into a Legacy platform

At Splunk, our transition from a customer-hosted, on-premise model to a cloud-native architecture brought fundamental challenges — both technical and cultural. One of the earliest and most pressing concerns was infrastructure cost. As a company historically dependent on customers provisioning their own compute, we faced a steep learning curve in operating and scaling workloads efficiently in the cloud.

Meanwhile, competitors were already cloud-native — built from the ground up for elasticity and low operational overhead. We had a choice: rewrite the product for the cloud, or adapt what we had. We chose the latter, a bold decision that required architectural creativity, but allowed us to retain battle-tested functionality while pivoting quickly.

Proving Viability Under Tight Cost Constraints

Our COO issued a clear directive: Can you run a fully functional Splunk cloud instance with moderate traffic for under $50/month?

This became our North Star — a forcing function that catalyzed deep architectural simplification, focused experimentation, and efficient cloud-native integration, all without rewriting the core product.

Step 1: Decoupling State From the Monolith

We began by isolating and externalizing state management. In the on-premise version, ingestion, indexing, and search logic were deeply intertwined in a stateful monolith. To achieve elasticity, we needed a separation of concerns:

Ingestion: Shifted to external message queues like Amazon SQS and Kinesis, decoupling data entry points from Splunk core and enabling scalable, distributed ingest pipelines.
State Isolation: Moved indexing and search metadata out of local state and into a PostgreSQL Aurora database, enabling persistence, lookups, and coordination across state free containers.

Step 2: Embracing complexity, as it comes with the terrtory

Splunk’s on-prem architecture stores time-series data as hot and warm buckets on local disk, optimized for fast search access. In our cloud design, we designated Amazon S3 as the system of record for all indexed data — gaining scalability, durability, and cost efficiency.

However, simply moving all data to object storage would have introduced significant latency for search. Most real-time queries hit hot buckets, and those needed to remain fast.

To address this, we made a deliberate architectural choice: We mounted persistent EBS volumes into Splunk containers to host hot buckets locally. This enabled:

Fast access to recent data without relying on S3 fetches
Stateful data handling within otherwise elastic, replaceable containers
Efficient container reuse across sessions with minimized I/O overhead

While this introduced some operational complexity: managing volume lifecycle, attachment, and reuse; it preserved the high-performance search experience users expected, without compromising our broader cloud-native goals.

Step 3: Designing a Smart, Distributed Load Balancer

With data now in object storage and ingestion decoupled, the next hurdle was real-time job orchestration across containers.

We developed a custom load balancer that became the control plane for all container coordination:

Heartbeat Coordination: All Splunk containers would periodically register their health and capacity status.
Consistent Hashing: Incoming jobs (e.g., packet downloads, indexing tasks, or search queries) were routed to nodes using a hashing algorithm to ensure balance and data locality.
Workload-Aware Scheduling: The balancer considered which nodes already had relevant S3 data cached locally to reduce unnecessary I/O and improve latency.

This drastically improved search efficiency by co-locating compute with hot data — a fundamental principle of cloud-native design.

Step 4: Elastic Autoscaling

To complete the architecture, we added dynamic scaling controls:

Auto-Scale Up: When all containers were at capacity, new indexing/search containers were spun up on demand.
Auto-Scale Down: As user activity subsided (e.g., during off-peak hours), idle containers were gracefully shut down to minimize cost.

This behavior mirrored the operational patterns of cloud-native SaaS: pay for what you use, scale only when needed.

The Result: A Cloud-Efficient Splunk

We succeeded in building an elastic, auto-scaling version of Splunk capable of handling diverse workloads — ingestion, indexing, and search — with:

Cost targets under $50/month per small instance
Real-time queryability of freshly ingested data
Fully decoupled, cloud-native infrastructure
Significant reduction in operational overhead

This experiment proved that re-architecting for the cloud doesn’t always require a full rewrite. With a clear north star, smart orchestration, and modular refactoring, we built a scalable, efficient, and responsive cloud architecture on top of a legacy platform — unlocking elasticity without sacrificing capability.

Proving Viability Under Tight Cost Constraints#

Step 1: Decoupling State From the Monolith#

Step 2: Embracing complexity, as it comes with the terrtory#

Step 3: Designing a Smart, Distributed Load Balancer#

Step 4: Elastic Autoscaling#

The Result: A Cloud-Efficient Splunk#