USENIX SREcon24 Europe/Middle East/Africa

USENIX SREcon is a conference series organized by the USENIX Association, primarily focused on Site Reliability Engineering and related fields in large-scale, high-reliability, high-availability systems. It’s a gathering point for professionals working in SRE, systems engineering, software engineering, and infrastructure operations. SREcon brings together experts and practitioners to share knowledge, best practices, and insights into maintaining the reliability and performance of complex systems.

Many talks cover case studies from major tech companies, this real-world insight is valuable for professionals facing similar issues. The organizers prioritize diversity and inclusion, offering scholarships, mentoring, and programming aimed at underrepresented groups in tech to encourage broad participation.

SREcon serves as a critical venue for networking, professional development, and sharing innovative solutions.

Convention Centre Dublin 1

My Takeaways

Companies aim to leverage mature tools and extract actionable insights from them to maximize value and reduce costs, particularly within observability and incident response. This is a feedback loop that enables increased quality and velocity.

Data governance is a complex topic, whether you’re working with S3, a data warehouse, a data store, a database, big data or something else. In all cases, load balancing, horizontal scaling and distributed consensus are crucial to success.

Development portals are becoming essential as the field evolves rapidly. Commercial solutions are racing to keep up with Backstage’s leadership; however, they are still not sufficiently extensible or customizable to meet all needs.

eBPF has reached maturity, and its application is expanding into use cases I would not have anticipated.

Some engineers seem to have an unusual fascination with Slack, which I don’t share due to its poor UX.

Convention Centre Dublin 2

My Event

Dude, You Forgot the Feedback: How Your Open Loop Control Planes Are Causing Outages Laura de Vesine - Datadog
You Depend on Time, This Is How It Works and You Won’t Believe It Philip Rowlands - Jane Street
Workshop: Loadshedding and Isolation Using Envoy Proxy Laura Nolan & Niall Murphy - Stanza
Achieving Excellence: SLO Thresholds That Transform Service Quality Thiara Ortiz - Netflix
Selective Reliability Engineering: There Is No Single Source of Truth Elise Burke - Datadog
Why You’re (Probably) Doing Service Catalogs Wrong Lisa Karlin Curtis - incident.io
Exploring the Unintended Consequences of Automation in Software Courtney Nash - The VOID
Rock around the Clock (Synchronization): Improve Performance with High Precision Time! Lerna Ekmekcioglu - Clockwork Systems
Mnemonic Rules for Eponymous Laws or: There’s a Law for That! Peter Burkholder - U.S. Government
Lessons from Unix History Diomidis Spinellis - AUEB & TU Delft
Treat Your Code as a Crime Scene Adam Tornhill - CodeScene
From PIDs to Pods: The Life Cycle of an eBPF-Autoinstrumented Application Marc Tudurí - Grafana Labs
Scheduling at Scale: eBPF Schedulers with Sched_ext Daniel Hodges - Meta
Noisy Neighbors, through Networking René Treffer and Ben Kochie - Reddit
Taming Noisy Benchmark Results Using Change Point Detection Matt Fleming - Cloudflare
How a Single API Endpoint Saved Us 3000 CPU Lasse Hels - Maersk
Managing the Risk of Software Supply Chain Attacks Mark Hahn - Qualys
Synthetic Monitoring and E2E Testing: 2 Sides of the Same Coin Carly Richmond - Elastic
Re-Building Envoy in Rust Dawid Nowak - Huawei Ireland Research Lab
What About the Engineer's MTTR? Ian Duffy - Cloudsmith
How to SRE Anything to Work Smarter and Live Better Jennifer Petoff - Google
How SRE Can Help With Cost & Efficiency John Looney - Crusoe Energy
SRE for LLMs: What We Learned While Launching John Lunney - Google
Breaking Out of Our Hybrid Cloud Datastore EOL Chains Konstantinos Fardelas - Skroutz SA
The Voyager Spacecraft—These Are the Only Engineers on Earth Who Want To Maximize Latency Robert Barron - IBM
Rollout Monitoring at Scale: Reflections on Adopting Canarying in GCE Roberto Frenna - Google
9 SLIs; OH MY! Sal Furino - Bloomberg CRE
Opening the Box: Diagnosing Operating-System Task-Scheduler Behavior on Highly Multicore Machines Julia Lawall - Inria-Paris
Granular CPU Capacity Management at Scale with eBPF George Brighton and Cameron Howes - Goldman Sachs
Riot Games: Evolution of Observability at the Gaming Company - Erick Moreira and Kirill Mikhailov - Riot Games
A Powerful Logs Management Solution We All Have and Use but We Underestimate: systemd-journal Costa Tsaousis - Netdata
Blast Radius Reduction for Large-Scale Distributed Systems Linhua Tang - Huawei Ireland Research Centre
Get Your Non-SREs Oncall Ready! JC van Winkel and Brad Lipinski - Google
Transforming Production Readiness Panagiotis Moustafellos - Elastic
Energy Consumption of Datacenters Thomas Fricke
Are We Really Engineers? Hillel Wayne

Engineering is ... what engineers do

My Takeaways

My Event

References