Data Dictionary¶
Status: Active | Version: 1.0.0 | Last Updated: 2026-05-23
This document defines every significant data entity, field, and enum used across the ML Incident Response Platform — in the PostgreSQL schema, the API payloads, and the Prometheus metrics surface. Engineers and analysts should treat this as the canonical source of truth for field names, types, allowed values, and business-level definitions.
Entity: incidents¶
The core table. Each row represents a discrete operational event affecting one or more ML models in production.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
UUID | No | Primary key, generated by gen_random_uuid() |
title |
VARCHAR(500) | No | Human-readable summary. Populated by the engineer or automated alert. |
severity |
ENUM | No | See Severity Levels below. |
status |
ENUM | No | Current lifecycle state. See Incident Status. |
model_id |
UUID | No | Foreign key → models.id. The primary affected model. |
description |
TEXT | Yes | Free-form markdown body. Supports embedded links and code blocks. |
trigger_source |
ENUM | No | How the incident was created: manual, automated_drift, automated_performance, pagerduty. |
labels |
TEXT[] | No | Arbitrary string tags. Default: {}. Used for filtering and routing rules. |
assigned_to |
VARCHAR(255) | Yes | PagerDuty on-call schedule name or individual user slug. |
runbook_url |
TEXT | Yes | Link to the applicable runbook in runbooks/. |
created_at |
TIMESTAMPTZ | No | Set on insert, never updated. |
updated_at |
TIMESTAMPTZ | No | Updated on every write via trigger. |
resolved_at |
TIMESTAMPTZ | Yes | Populated when status transitions to resolved. |
closed_at |
TIMESTAMPTZ | Yes | Populated when status transitions to closed. |
created_by |
VARCHAR(255) | No | JWT sub claim of the user or service account that created the record. |
Severity Levels¶
| Value | Label | SLO — Time to Acknowledge | SLO — Time to Resolve | Typical Trigger |
|---|---|---|---|---|
P1 |
Critical | 5 minutes | 1 hour | Production model completely unavailable or returning null predictions |
P2 |
High | 15 minutes | 4 hours | PSI > 0.25 on a revenue-critical model; error rate > 5% |
P3 |
Medium | 1 hour | 24 hours | PSI 0.15–0.25; performance degradation within acceptable range |
P4 |
Low | 4 hours | 72 hours | Minor drift detected; scheduled maintenance window needed |
Incident Status¶
Status values follow a strict state machine. See governance.md for the full transition diagram.
| Value | Description |
|---|---|
open |
Created but not yet acknowledged |
investigating |
Acknowledged; root cause analysis in progress |
mitigated |
Immediate impact contained; root cause may still be unknown |
resolved |
Root cause identified and fix confirmed in production |
closed |
Post-mortem completed and action items assigned |
Entity: incident_timeline¶
Append-only audit log of all events that occurred during an incident's lifecycle.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
UUID | No | Primary key |
incident_id |
UUID | No | Foreign key → incidents.id |
event_type |
ENUM | No | note, status_change, escalation, runbook_step_completed, automated_alert |
body |
TEXT | No | Markdown content of the event |
author |
VARCHAR(255) | No | User slug or service account identifier |
occurred_at |
TIMESTAMPTZ | No | When the event occurred (may differ from created_at for retroactive entries) |
created_at |
TIMESTAMPTZ | No | When the record was inserted |
Entity: models¶
Registry of ML models monitored by the platform.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
UUID | No | Primary key |
name |
VARCHAR(255) | No | Human-readable model identifier, e.g. credit-risk-v3 |
version |
VARCHAR(50) | No | Semantic version or git SHA of the model artifact |
stage |
ENUM | No | staging, production, deprecated |
team |
VARCHAR(255) | No | Owning team slug. Used for alert routing. |
feature_store_ref |
TEXT | Yes | Path or URI to the feature store view used at training time |
model_card_url |
TEXT | Yes | Link to the model card document |
registered_at |
TIMESTAMPTZ | No | When the model was added to the registry |
promoted_at |
TIMESTAMPTZ | Yes | When the model was promoted to production |
Entity: drift_assessments¶
Stores point-in-time drift evaluation results for each model run by the drift detection pipeline.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
UUID | No | Primary key |
model_id |
UUID | No | Foreign key → models.id |
assessed_at |
TIMESTAMPTZ | No | Window end time of the assessment |
psi_composite |
FLOAT | No | Weighted composite PSI across all features. Threshold: 0.15 (warn), 0.25 (critical). |
ks_p_value |
FLOAT | No | KS-test p-value across the prediction distribution. Values < 0.05 indicate significant drift. |
drift_detected |
BOOLEAN | No | true if any individual feature PSI > threshold or KS p-value < 0.05 |
feature_scores |
JSONB | No | Map of feature_name → psi_score for per-feature drill-down |
reference_window_start |
TIMESTAMPTZ | No | Start of the training/baseline data window |
reference_window_end |
TIMESTAMPTZ | No | End of the training/baseline data window |
evaluation_window_start |
TIMESTAMPTZ | No | Start of the production data window compared against reference |
evaluation_window_end |
TIMESTAMPTZ | No | End of the production data window |
Prometheus Metrics¶
The platform exposes a /metrics endpoint (Prometheus text format) scraped every 30 seconds.
| Metric name | Type | Labels | Description |
|---|---|---|---|
mlplatform_incidents_total |
Counter | severity, trigger_source |
Total incidents created since service start |
mlplatform_incidents_open |
Gauge | severity |
Currently open incidents by severity |
mlplatform_drift_psi_composite |
Gauge | model_id, model_name |
Latest composite PSI score per model |
mlplatform_drift_detections_total |
Counter | model_id, drift_detected |
Total drift assessments run |
mlplatform_api_request_duration_seconds |
Histogram | method, endpoint, status_code |
API latency distribution |
mlplatform_redis_denylist_size |
Gauge | — | Current number of denylisted JWT JTIs in Redis |
Enum Reference¶
All enum values are stored as lowercase strings in PostgreSQL to avoid case-sensitivity bugs during ORM mapping.
| Enum name | Values |
|---|---|
incident_severity |
p1, p2, p3, p4 |
incident_status |
open, investigating, mitigated, resolved, closed |
trigger_source |
manual, automated_drift, automated_performance, pagerduty |
model_stage |
staging, production, deprecated |
timeline_event_type |
note, status_change, escalation, runbook_step_completed, automated_alert |