ML Incident Response Platform — Governance Policy¶
Document ID: GOV-001 Version: 2.0.0 Owner: ML Platform Governance Board Last Updated: 2026-05-23 Review Cycle: Quarterly (next review: 2026-08-23) Classification: Internal — Operational
1. Purpose and Scope¶
This policy governs the operational use, change management, incident accountability, and compliance controls for the ML Incident Response Platform. It applies to all personnel with read or write access to production ML systems, incident records, and the operational runbook system.
Deviations from this policy require written approval from the Platform Governance Board and must be documented as exceptions in the active incident ticket.
2. RACI Matrix¶
| Activity | Incident Commander | ML Platform SRE | ML Engineer | Data Engineer | Platform Governance |
|---|---|---|---|---|---|
| Declare SEV-1 / SEV-2 | A | R | I | I | I |
| Execute runbook | I | R | C | C | I |
| Authorise rollback | A | R | C | I | I |
| Stakeholder communications | R | C | I | I | I |
| Postmortem facilitation | C | R | C | C | I |
| Runbook change approval | I | C | C | I | A/R |
| Compliance and audit reporting | I | C | I | I | R |
| Data classification decisions | I | I | C | R | A |
| Break-glass production change | A | R | I | I | A |
R = Responsible | A = Accountable | C = Consulted | I = Informed
3. Severity and Escalation Policy¶
SEV-1 — Production Crisis¶
- Criteria: Full outage, > 10% user impact, SLO breach > 30 minutes, or confirmed revenue impact
- Acknowledgment SLA: ≤ 5 minutes
- Mitigation SLA: ≤ 30 minutes
- Resolution SLA: ≤ 4 hours
- Escalation: ML Platform Lead + VP Engineering at T+30 min if not mitigated
- Communication:
#ml-incidents-sev1; stakeholder update every 15 minutes - Postmortem: Required within 48 hours, blameless format
SEV-2 — Major Degradation¶
- Criteria: Partial outage; SLO degraded but not fully breached; workaround available
- Acknowledgment SLA: ≤ 15 minutes
- Mitigation SLA: ≤ 2 hours
- Resolution SLA: ≤ 24 hours
- Escalation: ML Platform Lead at T+2 h if not mitigated
- Postmortem: Required within 1 week
SEV-3 — Limited Impact¶
- Criteria: Single team or system affected; no SLO breach; workaround exists
- Acknowledgment SLA: ≤ 1 hour
- Mitigation SLA: ≤ 8 hours
- Resolution SLA: ≤ 72 hours
- Escalation: SRE Lead at T+8 h if not mitigated
- Postmortem: Recommended for recurring patterns
SEV-4 — Low Impact¶
- Criteria: No user impact; minor degradation; caught by automated monitoring
- Acknowledgment SLA: ≤ 4 hours
- Resolution SLA: Next sprint
- Postmortem: Not required
4. SLO Reference Table¶
| SLO ID | Service | Metric | Target | Measurement Window |
|---|---|---|---|---|
| ML-SLO-001 | Incident API | HTTP 5xx error rate | < 1% | 28-day rolling |
| ML-SLO-002 | Incident API | P99 response latency | < 2 seconds | 28-day rolling |
| ML-SLO-003 | ML Models | P95 prediction accuracy | ≥ 92% | 7-day rolling |
| ML-SLO-004 | ETL Pipelines | Run completion within 2 h | ≥ 99% of scheduled runs | Weekly |
| ML-SLO-005 | Drift Detection | Alert-to-acknowledgment for SEV-1 | 100% within 5 minutes | Monthly |
Error Budget Policy: If any SLO error budget falls below 20% of the monthly allowance, all non-critical deployments are frozen until the budget is restored or explicitly waived by the Platform Governance Board.
5. Change Management Policy¶
5.1 Runbook and Diagram Changes¶
All changes to runbooks/*.md or diagrams/*.mmd require:
- Pull request with at least one @ml-platform-sre review approval
- No self-merges (enforced via .github/CODEOWNERS)
- PR description must state: motivation, risk, and test scenario
- Version bump in the runbook front-matter header (version: field)
5.2 Infrastructure and API Changes¶
All changes to infrastructure/, ci_cd/, or api/ require:
- Two-reviewer approval (enforced via CODEOWNERS)
- Successful CI pipeline including all security gates
- Staging deployment and smoke-test validation before production promotion
- Rollback plan documented in the PR description
5.3 Emergency Break-Glass Changes¶
For SEV-1 incidents requiring immediate production changes: - Single on-call SRE approval is permitted - The change must be tracked in the active incident ticket with justification - Full retrospective review is required in the postmortem - The change must be re-applied through the normal process within 72 hours - All break-glass events are reported to the Platform Governance Board
6. Data Governance and Classification¶
| Classification | Definition | Handling Requirements |
|---|---|---|
| PUBLIC | Architecture docs, runbook structure, diagram templates | No restrictions |
| INTERNAL | Incident logs, metric baselines, operational configs | Internal access only |
| CONFIDENTIAL | PII-adjacent data in logs, non-secret auth headers | Encrypted at rest and in transit; PII scrubbed before log emission |
| RESTRICTED | Private keys, DB credentials, JWT secrets, API tokens | Vault or K8s Secrets only; never in code, configs, or logs |
PII Handling Rule: All application log events MUST pass through the PII scrubbing
processor chain in observability/logging_config.py before emission.
Incident records must not contain raw user identifiers, email addresses, or IP addresses.
LLM Input/Output Rule: No CONFIDENTIAL or RESTRICTED data may be sent to external LLM providers (GPT, Claude, Gemini, etc.) without explicit Data Processing Agreement (DPA) approval from Legal.
7. Audit and Compliance Controls¶
7.1 Audit Log Requirements¶
All audit events (incident creation, status changes, resolution, access control
decisions) must include:
- Actor identity (sub from JWT, or service account name)
- Timestamp (UTC ISO-8601)
- Action type (e.g. incident.created, incident.status_updated)
- Affected resource ID
- log_type: "audit" field for SIEM routing
Audit logs must be:
- Retained for a minimum of 90 days (configurable via LOG_RETENTION_DAYS)
- Forwarded to an immutable sink (CloudWatch Logs with Object Lock, Splunk, or Elastic)
- Accessible to the compliance team on request
7.2 SOC 2 Type II Control Mapping¶
| SOC 2 Control | Implementation in This Repository |
|---|---|
| CC6.1 — Logical access controls | JWT RBAC in api/app.py; @require_role decorator |
| CC6.2 — Authentication | OAuth2 password flow; bcrypt hashing via passlib |
| CC6.8 — Vulnerability prevention | pip-audit + Bandit + Trivy in CI pipeline |
| CC7.2 — Anomaly monitoring | observability/anomaly_detection.py; monitoring/alert_rules.yml |
| CC7.4 — Incident response | This runbook system; severity_matrix.md |
| CC8.1 — Change management | Branch protection + CODEOWNERS + CI security gates |
7.3 GDPR Article 22 Obligations¶
Where ML models make or assist automated decisions that materially affect individuals: - Model version, feature inputs, and decision output must be logged per prediction - Data subjects have a right to human review; the escalation path must be documented in the relevant runbook - Model fairness and demographic parity metrics must be tracked alongside accuracy
8. Incident Accountability¶
- Every SEV-1 or SEV-2 incident must have a named Incident Commander assigned within 5 minutes of declaration
- The IC is accountable for: timeline accuracy, stakeholder communication cadence, and postmortem scheduling
- Postmortems are blameless by policy. Findings must target systems, processes, and tooling — not individuals
- All action items must carry: owner, due date, severity classification, and a linked ticket reference
9. Third-Party and LLM Provider Risk¶
For systems that consume external LLM APIs:
- Provider availability SLA must be tracked against the platform error budget
- Rate limits and cost controls must be configured (see runbooks/llm_cost_spike.md)
- Fallback behaviour (degraded mode, cached responses, circuit breaker) must be
defined, tested, and documented
- Provider outages are classified as SEV-2 or above if they affect user-facing features
10. Policy Violations¶
Violations of this governance policy must be reported to the Platform Governance Board. Repeated or intentional violations may result in: - Access revocation - Escalation to the Information Security team - Formal HR review
Reporting channels:
- Internal: #ml-governance-escalations
- Security incidents: security@[your-domain]
- Anonymous: [Link to whistleblower policy]