Site Reliability Engineer · resilience, observability & on-call sanity · Berlin, DE

Sam Okoye

Site Reliability Engineer · resilience, observability & on-call sanity

Site Reliability Engineer · resilience, observability & on-call sanity

SRE who believes reliability is a feature you design in, not a fire you fight later. I've run platforms through Black-Friday-scale spikes and the boring 2am degradations nobody tweets about. I measure success in pages not sent and runbooks nobody had to open.

02

Selected work

Multi-region SLO program

Defined error-budget-based SLOs across the platform and wired alerting to symptoms users feel, not noisy host metrics.

Availability 99.9% → 99.98%; page volume down 60%.

MTTR teardown

Rebuilt the incident lifecycle — clearer alerts, tested runbooks, blameless postmortems — and drilled it with game days.

MTTR 47min → 9min; three SPOFs eliminated.
03

Experience

2020-01NOW
Senior Site Reliability Engineer
Helix Cloud
  • Owned SLOs for a multi-region platform serving 30M MAU; lifted availability from 99.9% to 99.98%.
  • Cut mean time to recovery from 47 to 9 minutes with better alerting, runbooks and blameless reviews.
  • Led chaos-engineering game days that surfaced and closed three latent single points of failure.
2016-072019-12
DevOps Engineer
Kettle Systems
  • Built the CI/CD platform and Terraform modules behind 40+ services' deploys.
  • Migrated the fleet to Kubernetes with zero customer-facing downtime.
04

Toolkit

SREKubernetesTerraformPrometheusIncident responseSLOsCI/CDChaos engineering