Skip to content

ML Incident Response Playbook

Home

zrlopez/ml-incident-response-playbook

ML Incident Response Playbook¶

A production-grade FastAPI service for detecting, triaging, and resolving ML system incidents — model degradation, data quality failures, pipeline outages, and LLM cost spikes.

Built to demonstrate end-to-end ML operations engineering: hardened CI/CD, structured observability, async PostgreSQL persistence, Redis-backed JWT revocation, and Prometheus + OpenTelemetry instrumentation.

What this is¶

Layer	Stack
API	FastAPI + Pydantic v2 + SQLAlchemy 2.x async
Auth	PyJWT RS256 + bcrypt + Redis denylist
Persistence	PostgreSQL (prod) / SQLite (dev/test) via Alembic
Observability	Prometheus + OpenTelemetry (OTLP/gRPC) + structlog
CI/CD	GitHub Actions — secrets scan, SAST, dep-audit, unit + integration tests, Trivy container scan, SBOM
Diagrams	Mermaid (auto-rendered to PNG in CI)

Quick start¶

bash git clone https://github.com/zrlopez/ml-incident-response-playbook.git cd ml-incident-response-playbook cp .env.example .env # fill in secrets docker compose up --build # API at http://localhost:8000

See Setup and Deployment for full details.

Architecture — component diagram, layer descriptions, request lifecycle, tech decisions
API Reference — endpoint contracts, auth flows, error schemas
Monitoring — Prometheus metrics, alert rules, drift detection
Governance — data handling, PII policy, SLA definitions
Troubleshooting — on-call triage steps
Contributing — branch model, commit convention, PR checklist