Multi-region SLO program
Defined error-budget-based SLOs across the platform and wired alerting to symptoms users feel, not noisy host metrics.
Availability 99.9% → 99.98%; page volume down 60%.Site Reliability Engineer · resilience, observability & on-call sanity
SRE who believes reliability is a feature you design in, not a fire you fight later. I've run platforms through Black-Friday-scale spikes and the boring 2am degradations nobody tweets about. I measure success in pages not sent and runbooks nobody had to open.
Defined error-budget-based SLOs across the platform and wired alerting to symptoms users feel, not noisy host metrics.
Availability 99.9% → 99.98%; page volume down 60%.Rebuilt the incident lifecycle — clearer alerts, tested runbooks, blameless postmortems — and drilled it with game days.
MTTR 47min → 9min; three SPOFs eliminated.