ML Incident Response Playbook¶
A production-grade FastAPI service for detecting, triaging, and resolving ML system incidents — model degradation, data quality failures, pipeline outages, and LLM cost spikes.
Built to demonstrate end-to-end ML operations engineering: hardened CI/CD, structured observability, async PostgreSQL persistence, Redis-backed JWT revocation, and Prometheus + OpenTelemetry instrumentation.
What this is¶
| Layer | Stack |
|---|---|
| API | FastAPI + Pydantic v2 + SQLAlchemy 2.x async |
| Auth | PyJWT RS256 + bcrypt + Redis denylist |
| Persistence | PostgreSQL (prod) / SQLite (dev/test) via Alembic |
| Observability | Prometheus + OpenTelemetry (OTLP/gRPC) + structlog |
| CI/CD | GitHub Actions — secrets scan, SAST, dep-audit, unit + integration tests, Trivy container scan, SBOM |
| Diagrams | Mermaid (auto-rendered to PNG in CI) |
Quick start¶
bash
git clone https://github.com/zrlopez/ml-incident-response-playbook.git
cd ml-incident-response-playbook
cp .env.example .env # fill in secrets
docker compose up --build # API at http://localhost:8000
See Setup and Deployment for full details.
Navigation¶
- Architecture — component diagram, layer descriptions, request lifecycle, tech decisions
- API Reference — endpoint contracts, auth flows, error schemas
- Monitoring — Prometheus metrics, alert rules, drift detection
- Governance — data handling, PII policy, SLA definitions
- Troubleshooting — on-call triage steps
- Contributing — branch model, commit convention, PR checklist