Companies aim to leverage mature tools and extract actionable insights from them to maximize value and reduce costs, particularly within observability and incident response. This is a feedback loop that enables increased quality and velocity.
Data governance is a complex topic, whether you’re working with S3, a data warehouse, a data store, a database, big data or something else. In all cases, load balancing, horizontal scaling and distributed consensus are crucial to success.
Development portals are becoming essential as the field evolves rapidly. Commercial solutions are racing to keep up with Backstage’s leadership; however, they are still not sufficiently extensible or customizable to meet all needs.
eBPF has reached maturity, and its application is expanding into use cases I would not have anticipated.
Some engineers seem to have an unusual fascination with Slack, which I don’t share due to its poor UX.
Dude, You Forgot the Feedback: How Your Open Loop Control Planes Are Causing Outages
Laura de Vesine - Datadog
You Depend on Time, This Is How It Works and You Won’t Believe It
Philip Rowlands - Jane Street
Workshop: Loadshedding and Isolation Using Envoy Proxy
Laura Nolan & Niall Murphy - Stanza
Achieving Excellence: SLO Thresholds That Transform Service Quality
Thiara Ortiz - Netflix
Selective Reliability Engineering: There Is No Single Source of Truth
Elise Burke - Datadog
Why You’re (Probably) Doing Service Catalogs Wrong
Lisa Karlin Curtis - incident.io
Exploring the Unintended Consequences of Automation in Software
Courtney Nash - The VOID
Rock around the Clock (Synchronization): Improve Performance with High Precision Time!
Lerna Ekmekcioglu - Clockwork Systems
Mnemonic Rules for Eponymous Laws or: There’s a Law for That!
Peter Burkholder - U.S. Government
Lessons from Unix History
Diomidis Spinellis - AUEB & TU Delft
Treat Your Code as a Crime Scene
Adam Tornhill - CodeScene
From PIDs to Pods: The Life Cycle of an eBPF-Autoinstrumented Application
Marc Tudurí - Grafana Labs
Scheduling at Scale: eBPF Schedulers with Sched_ext
Daniel Hodges - Meta
Noisy Neighbors, through Networking
René Treffer and Ben Kochie - Reddit
Taming Noisy Benchmark Results Using Change Point Detection
Matt Fleming - Cloudflare
How a Single API Endpoint Saved Us 3000 CPU
Lasse Hels - Maersk
Managing the Risk of Software Supply Chain Attacks
Mark Hahn - Qualys
Synthetic Monitoring and E2E Testing: 2 Sides of the Same Coin
Carly Richmond - Elastic
Re-Building Envoy in Rust
Dawid Nowak - Huawei Ireland Research Lab
What About the Engineer's MTTR?
Ian Duffy - Cloudsmith
How to SRE Anything to Work Smarter and Live Better
Jennifer Petoff - Google
How SRE Can Help With Cost & Efficiency
John Looney - Crusoe Energy
SRE for LLMs: What We Learned While Launching
John Lunney - Google
Breaking Out of Our Hybrid Cloud Datastore EOL Chains
Konstantinos Fardelas - Skroutz SA
The Voyager Spacecraft—These Are the Only Engineers on Earth Who Want To Maximize Latency
Robert Barron - IBM
Rollout Monitoring at Scale: Reflections on Adopting Canarying in GCE
Roberto Frenna - Google
9 SLIs; OH MY!
Sal Furino - Bloomberg CRE
Opening the Box: Diagnosing Operating-System Task-Scheduler Behavior on Highly Multicore Machines
Julia Lawall - Inria-Paris
Granular CPU Capacity Management at Scale with eBPF
George Brighton and Cameron Howes - Goldman Sachs
Riot Games: Evolution of Observability at the Gaming Company
- Erick Moreira and Kirill Mikhailov - Riot Games
A Powerful Logs Management Solution We All Have and Use but We Underestimate: systemd-journal
Costa Tsaousis - Netdata
Blast Radius Reduction for Large-Scale Distributed Systems
Linhua Tang - Huawei Ireland Research Centre
Get Your Non-SREs Oncall Ready!
JC van Winkel and Brad Lipinski - Google
Transforming Production Readiness
Panagiotis Moustafellos - Elastic
Energy Consumption of Datacenters
Thomas Fricke
Are We Really Engineers?
Hillel Wayne