Skip to content

Data Dictionary

Status: Active | Version: 1.0.0 | Last Updated: 2026-05-23

This document defines every significant data entity, field, and enum used across the ML Incident Response Platform — in the PostgreSQL schema, the API payloads, and the Prometheus metrics surface. Engineers and analysts should treat this as the canonical source of truth for field names, types, allowed values, and business-level definitions.


Entity: incidents

The core table. Each row represents a discrete operational event affecting one or more ML models in production.

Column Type Nullable Description
id UUID No Primary key, generated by gen_random_uuid()
title VARCHAR(500) No Human-readable summary. Populated by the engineer or automated alert.
severity ENUM No See Severity Levels below.
status ENUM No Current lifecycle state. See Incident Status.
model_id UUID No Foreign key → models.id. The primary affected model.
description TEXT Yes Free-form markdown body. Supports embedded links and code blocks.
trigger_source ENUM No How the incident was created: manual, automated_drift, automated_performance, pagerduty.
labels TEXT[] No Arbitrary string tags. Default: {}. Used for filtering and routing rules.
assigned_to VARCHAR(255) Yes PagerDuty on-call schedule name or individual user slug.
runbook_url TEXT Yes Link to the applicable runbook in runbooks/.
created_at TIMESTAMPTZ No Set on insert, never updated.
updated_at TIMESTAMPTZ No Updated on every write via trigger.
resolved_at TIMESTAMPTZ Yes Populated when status transitions to resolved.
closed_at TIMESTAMPTZ Yes Populated when status transitions to closed.
created_by VARCHAR(255) No JWT sub claim of the user or service account that created the record.

Severity Levels

Value Label SLO — Time to Acknowledge SLO — Time to Resolve Typical Trigger
P1 Critical 5 minutes 1 hour Production model completely unavailable or returning null predictions
P2 High 15 minutes 4 hours PSI > 0.25 on a revenue-critical model; error rate > 5%
P3 Medium 1 hour 24 hours PSI 0.15–0.25; performance degradation within acceptable range
P4 Low 4 hours 72 hours Minor drift detected; scheduled maintenance window needed

Incident Status

Status values follow a strict state machine. See governance.md for the full transition diagram.

Value Description
open Created but not yet acknowledged
investigating Acknowledged; root cause analysis in progress
mitigated Immediate impact contained; root cause may still be unknown
resolved Root cause identified and fix confirmed in production
closed Post-mortem completed and action items assigned

Entity: incident_timeline

Append-only audit log of all events that occurred during an incident's lifecycle.

Column Type Nullable Description
id UUID No Primary key
incident_id UUID No Foreign key → incidents.id
event_type ENUM No note, status_change, escalation, runbook_step_completed, automated_alert
body TEXT No Markdown content of the event
author VARCHAR(255) No User slug or service account identifier
occurred_at TIMESTAMPTZ No When the event occurred (may differ from created_at for retroactive entries)
created_at TIMESTAMPTZ No When the record was inserted

Entity: models

Registry of ML models monitored by the platform.

Column Type Nullable Description
id UUID No Primary key
name VARCHAR(255) No Human-readable model identifier, e.g. credit-risk-v3
version VARCHAR(50) No Semantic version or git SHA of the model artifact
stage ENUM No staging, production, deprecated
team VARCHAR(255) No Owning team slug. Used for alert routing.
feature_store_ref TEXT Yes Path or URI to the feature store view used at training time
model_card_url TEXT Yes Link to the model card document
registered_at TIMESTAMPTZ No When the model was added to the registry
promoted_at TIMESTAMPTZ Yes When the model was promoted to production

Entity: drift_assessments

Stores point-in-time drift evaluation results for each model run by the drift detection pipeline.

Column Type Nullable Description
id UUID No Primary key
model_id UUID No Foreign key → models.id
assessed_at TIMESTAMPTZ No Window end time of the assessment
psi_composite FLOAT No Weighted composite PSI across all features. Threshold: 0.15 (warn), 0.25 (critical).
ks_p_value FLOAT No KS-test p-value across the prediction distribution. Values < 0.05 indicate significant drift.
drift_detected BOOLEAN No true if any individual feature PSI > threshold or KS p-value < 0.05
feature_scores JSONB No Map of feature_name → psi_score for per-feature drill-down
reference_window_start TIMESTAMPTZ No Start of the training/baseline data window
reference_window_end TIMESTAMPTZ No End of the training/baseline data window
evaluation_window_start TIMESTAMPTZ No Start of the production data window compared against reference
evaluation_window_end TIMESTAMPTZ No End of the production data window

Prometheus Metrics

The platform exposes a /metrics endpoint (Prometheus text format) scraped every 30 seconds.

Metric name Type Labels Description
mlplatform_incidents_total Counter severity, trigger_source Total incidents created since service start
mlplatform_incidents_open Gauge severity Currently open incidents by severity
mlplatform_drift_psi_composite Gauge model_id, model_name Latest composite PSI score per model
mlplatform_drift_detections_total Counter model_id, drift_detected Total drift assessments run
mlplatform_api_request_duration_seconds Histogram method, endpoint, status_code API latency distribution
mlplatform_redis_denylist_size Gauge Current number of denylisted JWT JTIs in Redis

Enum Reference

All enum values are stored as lowercase strings in PostgreSQL to avoid case-sensitivity bugs during ORM mapping.

Enum name Values
incident_severity p1, p2, p3, p4
incident_status open, investigating, mitigated, resolved, closed
trigger_source manual, automated_drift, automated_performance, pagerduty
model_stage staging, production, deprecated
timeline_event_type note, status_change, escalation, runbook_step_completed, automated_alert