How Meta Saved 15,000 Servers with a Tiny Code Change


Imagine you could save the equivalent of 15,000 servers’ worth of capacity with a single-character code change. That’s the scale of efficiency Meta achieved with Strobelight, a profiling orchestrator powered by eBPF. By leveraging eBPF’s ability to collect observability data with minimal overhead, Meta reduced CPU cycles by up to 20%, cutting infrastructure demands without compromising performance.

The Challenge: Profiling at Scale Without Overhead

Meta operates at a scale where even minor inefficiencies compound into massive infrastructure costs. Collecting detailed profiling data across its vast backend services posed a challenge: how to gather meaningful observability insights without introducing performance overhead or bloating storage requirements.

Traditional profiling methods often require modifying binaries or injecting additional instrumentation, both of which can slow down critical workloads. Meta needed a solution that could:

  • Provide profiling data with minimal impact on live services.
  • Normalize data across different environments and programming languages.
  • Work seamlessly across multiple Linux kernel versions.

Strobelight: eBPF-Powered Observability

To tackle this, Meta built Strobelight, a profiling orchestrator that integrates multiple profiling tools—including eBPF—to extract performance insights efficiently. eBPF allows engineers to attach probes to running processes without modifying application code, capturing:

  • CPU time spent in function calls and execution paths.
  • Call stacks for native and non-native languages like Python, Java, and Erlang.
  • Off-CPU time and service request latency.
  • AI/GPU profiling and advanced memory tracking.

Unlike traditional methods, Strobelight’s eBPF-driven approach enables real-time profiling without interfering with application execution, ensuring seamless performance across Meta’s infrastructure.

Efficiency Gains: Fewer Servers, Faster Debugging

Deploying Strobelight at Meta led to tangible improvements:

  • 15,000 servers’ worth of annual capacity savings—all from a single-character change in code.
  • Up to 20% fewer CPU cycles, reducing the number of required servers for Meta’s top services.
  • Accelerated debugging, helping engineers catch performance regressions before they hit production.
  • Optimized sampling, ensuring profiling data remains valuable without overwhelming storage systems.

One of Strobelight’s key strengths is its ability to adapt to different kernel versions, applying feature fallbacks where needed. This allows Meta to maintain consistent profiling capabilities across its diverse infrastructure without kernel-specific workarounds.

Why eBPF?

Meta chose eBPF for its low overhead, flexibility, and lack of runtime modifications. Unlike legacy profiling tools, which require application-level instrumentation, eBPF enables lightweight, high-resolution profiling without touching the codebase. This makes it ideal for observability at scale, spanning multiple languages and system configurations.

What’s Next?

Meta continues to refine Strobelight, expanding its capabilities into:

  • AI/ML workload observability, improving efficiency in deep learning models.
  • Advanced memory tracking, identifying and mitigating memory inefficiencies.
  • More complex efficiency analyses, optimizing resource allocation across the infrastructure.
  • Open sourcing Strobelight’s profilers and libraries, making these powerful tools available to the wider engineering community.

For more details, see the full case study on Meta’s Strobelight and eBPF efficiency gains here.