Skip to content

ML Incident Response Playbook

CI Docs License: MIT

A production-grade FastAPI service for detecting, triaging, and resolving ML system incidents — model degradation, data quality failures, pipeline outages, and LLM cost spikes.

Built to demonstrate end-to-end ML operations engineering: hardened CI/CD, structured observability, async PostgreSQL persistence, Redis-backed JWT revocation, and Prometheus + OpenTelemetry instrumentation.


What this is

Layer Stack
API FastAPI + Pydantic v2 + SQLAlchemy 2.x async
Auth PyJWT RS256 + bcrypt + Redis denylist
Persistence PostgreSQL (prod) / SQLite (dev/test) via Alembic
Observability Prometheus + OpenTelemetry (OTLP/gRPC) + structlog
CI/CD GitHub Actions — secrets scan, SAST, dep-audit, unit + integration tests, Trivy container scan, SBOM
Diagrams Mermaid (auto-rendered to PNG in CI)

Quick start

bash git clone https://github.com/zrlopez/ml-incident-response-playbook.git cd ml-incident-response-playbook cp .env.example .env # fill in secrets docker compose up --build # API at http://localhost:8000

See Setup and Deployment for full details.


  • Architecture — component diagram, layer descriptions, request lifecycle, tech decisions
  • API Reference — endpoint contracts, auth flows, error schemas
  • Monitoring — Prometheus metrics, alert rules, drift detection
  • Governance — data handling, PII policy, SLA definitions
  • Troubleshooting — on-call triage steps
  • Contributing — branch model, commit convention, PR checklist