Monitoring & Observability¶
Status: Active | Version: 1.0.0 | Last Updated: 2026-05-23
This document describes the full observability stack for the ML Incident Response Platform: what is instrumented, where metrics flow, how dashboards are organized, and what alert thresholds are configured and why.
Observability Stack Overview¶
The platform uses a three-pillar observability approach: metrics (Prometheus + Grafana), structured logs (JSON → stdout → Loki), and distributed traces (OpenTelemetry → Tempo). Each pillar is independently queryable but they share a unified correlation key: the trace_id emitted by the FastAPI middleware on every request and included in all log lines and spans.
The docker-compose.yml in the repo root spins up all three collection backends alongside the application. In production, each component maps to a managed equivalent (e.g., Grafana Cloud, Datadog, or an internal Prometheus federation).
Metrics¶
Scrape Configuration¶
Prometheus scrapes the /metrics endpoint on the FastAPI service every 30 seconds. The endpoint is internal-only (blocked by the middleware rate limiter for external callers) and returns standard Prometheus text exposition format.
All metrics are prefixed mlplatform_ to avoid collision in federated environments.
Key Metrics and Thresholds¶
Incident Metrics¶
mlplatform_incidents_open{severity="p1"}
Gauge. Any value > 0 for more than 5 minutes fires a P1 PagerDuty alert. P1 incidents should never remain unacknowledged.
mlplatform_incidents_open{severity="p2"}
Gauge. Alert fires if > 3 concurrent P2 incidents are open, indicating systemic issues rather than isolated model failures.
mlplatform_time_to_acknowledge_seconds
Histogram. Measured from incidents.created_at to the first status_change timeline event. The p95 target is ≤ 900 seconds (15 min) for P2 incidents.
Drift Metrics¶
mlplatform_drift_psi_composite{model_id, model_name}
Gauge. Updated after every drift assessment run. Grafana alert fires at > 0.15 (warning, P3 incident auto-created) and > 0.25 (critical, P2 incident auto-created).
mlplatform_drift_detections_total{drift_detected="true"}
Counter. Rate-of-change tracked over 24h windows. A sudden spike in drift detection rate across multiple models indicates a data pipeline issue upstream rather than model-specific degradation — this triggers a separate runbook (runbooks/pipeline-data-quality.md).
API Health Metrics¶
mlplatform_api_request_duration_seconds{endpoint="/incidents", method="POST"}
Histogram. Alert fires if p99 latency exceeds 500ms for more than 2 minutes. Incident creation is on the critical path for the automated alerting pipeline.
HTTP error rate:
Derived metric: rate(mlplatform_api_request_duration_seconds_count{status_code=~"5.."}[5m]) / rate(mlplatform_api_request_duration_seconds_count[5m]). Alert at > 1% error rate.
Logging¶
Format¶
All application logs are written as structured JSON to stdout. Each log line includes:
json
{
"timestamp": "2026-05-23T14:33:00.123Z",
"level": "INFO",
"logger": "mlplatform.incidents.router",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"message": "Incident created",
"incident_id": "550e8400-e29b-41d4-a716-446655440000",
"severity": "p2",
"model_id": "3fa85f64-5717-4562-b3fc-2c963f66afa6",
"user": "mlops-pipeline-sa"
}
PII policy: User-submitted free-text fields (description, body) are never logged in their entirety. They are replaced with [REDACTED:free_text] in log lines. This mirrors the annotation QA PII handling pattern from the data operations background of this project's author.
Log Levels¶
| Level | When used |
|---|---|
DEBUG |
Detailed internal state; disabled in production by default (LOG_LEVEL=INFO) |
INFO |
Normal operational events: requests, state transitions, drift assessments |
WARNING |
Recoverable conditions: retry attempts, deprecated endpoint calls, near-threshold PSI |
ERROR |
Unrecoverable errors that returned 5xx to the caller |
CRITICAL |
Service-level failures: database connection loss, Redis unavailable |
Log Retention¶
Local dev: logs are not retained (ephemeral container). Staging: 14-day retention in Loki. Production: 90-day retention with cold storage archival to S3-compatible object store after 30 days.
Alerting¶
Alert Routing¶
Alerts flow from Prometheus Alertmanager → PagerDuty. Routing rules are based on the severity label on the firing alert:
| Alert severity | PagerDuty urgency | Notification channel |
|---|---|---|
| P1 | Critical (immediate phone call) | Primary on-call + secondary on-call + #ml-incidents Slack |
| P2 | High (push notification) | Primary on-call + #ml-incidents Slack |
| P3 | Low (email) | #ml-drift-alerts Slack |
| P4 | Info (weekly digest) | #ml-ops-digest Slack |
Defined Alerts¶
| Alert name | Condition | Severity | Auto-creates incident? |
|---|---|---|---|
P1IncidentUnacknowledged |
mlplatform_incidents_open{severity="p1"} > 0 for 5m |
P1 | No (incident already exists) |
DriftWarning |
mlplatform_drift_psi_composite > 0.15 for 10m |
P3 | Yes |
DriftCritical |
mlplatform_drift_psi_composite > 0.25 for 5m |
P2 | Yes |
APIHighErrorRate |
Error rate > 1% for 2m | P2 | Yes |
APIHighLatency |
p99 POST /incidents > 500ms for 2m | P3 | Yes |
RedisDenylistUnreachable |
Redis health check fails for 1m | P1 | Yes |
Dashboards¶
Grafana dashboards are stored as JSON in dashboards/. They are provisioned automatically on docker compose up.
| Dashboard | File | Purpose |
|---|---|---|
| ML Platform Overview | dashboards/overview.json |
High-level KPIs: open incidents by severity, drift detection rate, API health |
| Drift Deep Dive | dashboards/drift.json |
Per-model PSI trends, feature-level heatmap, assessment cadence |
| Incident Operations | dashboards/incidents.json |
MTTA, MTTR, incident volume over time, severity distribution |
| API Performance | dashboards/api.json |
Request rates, latency percentiles, error rates by endpoint |
Health Endpoints¶
| Endpoint | Purpose |
|---|---|
GET /health |
Returns 200 OK when the service is running. Does not check dependencies. Used by load balancer. |
GET /health/ready |
Returns 200 OK only when PostgreSQL and Redis are reachable. Used by Kubernetes readiness probe. |
GET /health/live |
Returns 200 OK as long as the process is alive. Used by Kubernetes liveness probe. |
GET /metrics |
Prometheus text exposition. Internal-only. |
Runbook: Monitoring System Self-Failure¶
If Prometheus itself becomes unavailable:
- Check
docker compose ps— ensure theprometheuscontainer is running - Verify scrape target health at
http://localhost:9090/targets - If Alertmanager is unreachable, manually page the on-call engineer via PagerDuty web UI
- Escalate to
P1if the monitoring outage exceeds 15 minutes — blind production operation is a P1 event by definition
For Redis failure specifically (impacts JWT denylist integrity): follow runbooks/redis-failure.md. A Redis outage means revoked tokens may be re-accepted until Redis recovers; the mitigtion is to temporarily reduce JWT TTL to 60 seconds via the JWT_EXPIRE_SECONDS env var hot-reload.