Site Reliability Engineer · resilience, observability & on-call sanity · Berlin, DE

Sam Okoye

Site Reliability Engineer · resilience, observability & on-call sanity

SRE who believes reliability is a feature you design in, not a fire you fight later. I've run platforms through Black-Friday-scale spikes and the boring 2am degradations nobody tweets about. I measure success in pages not sent and runbooks nobody had to open.

GitHub ↗ Blog ↗

Selected work

Multi-region SLO program

Defined error-budget-based SLOs across the platform and wired alerting to symptoms users feel, not noisy host metrics.

Availability 99.9% → 99.98%; page volume down 60%.

MTTR teardown

Rebuilt the incident lifecycle — clearer alerts, tested runbooks, blameless postmortems — and drilled it with game days.

MTTR 47min → 9min; three SPOFs eliminated.

Experience

2020-01 — NOW

Senior Site Reliability Engineer

Helix Cloud

Owned SLOs for a multi-region platform serving 30M MAU; lifted availability from 99.9% to 99.98%.
Cut mean time to recovery from 47 to 9 minutes with better alerting, runbooks and blameless reviews.
Led chaos-engineering game days that surfaced and closed three latent single points of failure.

2016-07 — 2019-12

DevOps Engineer

Kettle Systems

Built the CI/CD platform and Terraform modules behind 40+ services' deploys.
Migrated the fleet to Kubernetes with zero customer-facing downtime.

Toolkit

SREKubernetesTerraformPrometheusIncident responseSLOsCI/CDChaos engineering