Engineering | 4 min read

On-Prem Observability Never Really Left

Over the past decade, monitoring followed infrastructure into the cloud. Dashboards centralized, metrics moved off host, and observability became something consumed remotely. For many teams this was a real improvement. Visibility became easier to share, simpler to operate, and faster to deploy.

And yet, on-prem observability never disappeared. Not as a legacy artifact or technical debt, but as a deliberate and persistent design choice in environments that care deeply about performance, latency, and failure analysis.

This isn't about resisting the cloud, but rather where truth in a system is first observed and what's lost when that truth is abstracted too early.

The Primary Signal Is Still Local

The most meaningful operational signals originate inside the operating system and the device itself. Storage latency, I/O queue depth, scheduler delays, interrupt handling, and page cache behavior all occur at the kernel level and often unfold over microseconds to milliseconds.

Modern tracing technologies such as eBPF exist precisely because observing these events after the fact is insufficient. Once telemetry is batched, aggregated, or exported, critical detail is already gone. Local collection is not an optimization, it is the only place where full fidelity exists.

Observability Shares the Same Failure Modes as the System

Monitoring systems are part of the system they observe. When storage stalls, CPUs saturate, or networks congest, telemetry pipelines are affected by the same conditions. Agents fall behind, buffers fill, and export paths drop data. Remote ingestion becomes less reliable exactly when insight is needed most. This is not a tooling failure, but rather a systems property - what researchers call "differential observability," where failure detectors may not notice problems even when applications are afflicted.[1]

Google's Site Reliability Engineering guidance repeatedly emphasizes designing monitoring with partial failure in mind. Any approach that assumes uninterrupted off-host visibility during an incident is making an optimistic assumption about reality. Local observability remains available when remote paths degrade. That property alone explains why it persists in critical environments.

Tail Behavior Dominates, and Averages Conceal It

One of the most important insights in distributed systems comes from the work on tail latency. When a request fans out to 100 servers in parallel, even if each has only a 1% chance of a slow response, 63% of all user requests will experience at least one slow response.[2] A small fraction of operations determine whether a system feels fast or broken.

Mean values, coarse sampling, and long export intervals don't reveal this. To understand tail behavior, systems must preserve distributions and capture short interval behavior close to the source. This is especially true for storage and kernel level events, where brief stalls can cascade into visible outages. This isn't just a theoretical concern, it's a documented and repeatedly observed phenomenon in large scale systems.

Sampling Is Not the Problem - Where It Happens Is.

Sampling often gets treated as inherently destructive, but that framing is inaccurate. Sampling becomes destructive when it occurs after causality has already been lost. Export time sampling, rate limiting, and mean based aggregation erase exactly the information engineers need to diagnose incidents. Google's SRE guidance notes that a web service with 100 ms average latency may have 1% of requests taking 5 seconds,[3] a 50x tail that mean-based sampling completely obscures.

Deterministic, short interval sampling performed at the source is fundamentally different. When combined with distribution preserving techniques such as histograms or buckets, it captures the shape of system behavior, including spikes and rare events, while keeping data volumes manageable.

The distinction isn't whether sampling exists. It's whether sampling preserves structure before abstraction. This is how teams reconcile fidelity with scale without pretending that raw export of everything is practical or necessary.

Centralization Is Valuable, but It Comes Second

Centralized systems excel at aggregation. They answer questions about scope, blast radius, and trends across fleets. They make it possible to correlate signals and reason at higher levels.

What they cannot do is reconstruct causality that was never captured. If you've ever tried to debug a latency spike using only your centralized dashboard, only to find the critical milliseconds were already aggregated away, you've experienced this firsthand.

Effective observability architectures reflect this ordering. High fidelity signals are captured locally first so context is preserved. Aggregation follows and visualization comes last. This pattern appears repeatedly in environments that operate large storage fleets, latency sensitive systems, and infrastructure where root cause analysis matters more than surface level health indicators.

Conclusion

This isn't an argument against cloud based tooling or centralized monitoring. Those tools solve real problems and will continue to play an important role. It is an argument for respecting where truth enters the system.

Teams that care about performance, reliability, and debuggability continue to design observability from the host outward, not the dashboard inward. They capture locally, preserve structure, and centralize deliberately.

That's why on-prem observability never really left, and why it likely never will.

Rivana

Rivana Storage Monitoring

Fleet-wide latency telemetry and health monitoring for enterprise storage

See how Rivana helps storage teams track latency trends, catch issues early, and maintain SLA compliance across thousands of drives.

References

  1. [1] Gray Failure: The Achilles' Heel of Cloud-Scale Systems Huang et al., HotOS, 2017
  2. [2] The Tail at Scale Dean & Barroso, Communications of the ACM, 2013
  3. [3] Monitoring Distributed Systems Google Site Reliability Engineering, 2016