Remediation Log β ML Incident Response APIΒΆ
Single source of truth for all remediation and roadmap work on the ML Incident Response Playbook. Status updated after every remediation cycle. Never remove completed items β mark VALIDATED.
Status LegendΒΆ
| Status | Meaning |
|---|---|
| BACKLOG | Identified, not yet started |
| IN PROGRESS | Actively being worked |
| BLOCKED | Waiting on dependency |
| FIXED | Code/config change committed to main |
| VALIDATED | Fix verified by test / CI pass |
| REVERTED | Committed then reverted; back to BACKLOG with note |
| DEFERRED | Intentionally postponed with documented rationale |
Priority TiersΒΆ
| Tier | Criteria |
|---|---|
| CRITICAL | Security vulnerability, broken runtime, CI hard failure |
| HIGH | Coverage regression, CI gate mismatch, missing safety net |
| MEDIUM | Architectural debt, observability gap, DX friction |
| LOW | Polish, documentation, portfolio signal |
Active TrackerΒΆ
π΄ CRITICAL / HIGH β Fix FirstΒΆ
| ID | Phase | Category | Issue | Sev | Status | Blocking Deps | Files Affected | Validation |
|---|---|---|---|---|---|---|---|---|
| R-P1 | Cycle 2 | CI/CD | Integration coverage gate 53% (CI-66b); recovery to β₯65% deferred as CI-67 | HIGH | BACKLOG | CI-67 fixture work | secured_ci.yml |
Gate restored β₯65%; CI green |
| R-P12 | Cycle 2 | Testing | CI-67 open: Redis / lifespan / auth paths unreachable in integration fixtures | HIGH | BACKLOG | CI-67 | tests/integration/ |
Integration coverage β₯65%; CI-67 closed |
| R-P23 | Phase 12 | Architecture | Refactor src/incident_tracker.py β thin facade over domain/services/repositories |
HIGH | BACKLOG | R-P22 | src/domain/, src/services/, src/repositories/ |
All existing tests pass; mypy clean |
| R-P57 | Phase 18 | Code Hygiene | Strip all inline micro-changelogs from source files β history belongs in git and CHANGELOG.md only | HIGH | BACKLOG | β | All src/, api/, tests/, config files |
grep -r "# .*20[0-9][0-9]-" src/ api/ returns zero hits; CI green |
π MEDIUM β Do NextΒΆ
| ID | Phase | Category | Issue | Sev | Status | Blocking Deps | Files Affected | Validation |
|---|---|---|---|---|---|---|---|---|
| R-P19 | Phase 13 | Architecture | src/incident_tracker.py module-level _engine singleton (Phase 13 DI migration) |
MEDIUM | DEFERRED | R-P23 | src/incident_tracker.py |
Engine constructed only inside lifespan context |
| R-P24 | Phase 12 | Architecture | Collapse deps to single source of truth: pyproject.toml + pip-compile |
MEDIUM | BACKLOG | R-P23 | pyproject.toml, requirements.txt |
pip-compile round-trips cleanly; lockfile-check CI green |
| R-P25 | Phase 12 | Architecture | Add minimum credible content to infrastructure/ β Terraform stub + README |
MEDIUM | BACKLOG | β | infrastructure/main.tf, infrastructure/README.md |
Files present; terraform validate passes |
| R-P26 | Phase 12 | Architecture | Add minimum credible content to dbt/ β README + one model stub |
MEDIUM | BACKLOG | β | dbt/README.md, dbt/models/incidents.sql |
Files present; renders in docs |
| R-P27 | Phase 12 | Architecture | Add minimum credible content to orchestration/ β README explaining DAG pattern |
MEDIUM | BACKLOG | β | orchestration/README.md |
File present; explains Prefect/Airflow integration |
| R-P28 | Phase 12 | Architecture | Update Architecture Mermaid diagram in README to reflect real code path | MEDIUM | BACKLOG | R-P23 | README.md |
Diagram matches: FastAPI β Auth β Services β Domain β Postgres/Redis |
| R-P33 | Phase 13 | CI/CD | Fix README CI/CD section coverage discrepancy (says β₯68%, gate is 75%) |
MEDIUM | BACKLOG | β | README.md |
README states correct gate value |
| R-P34 | Phase 14 | Runbooks | Expand all runbooks to operational template format | MEDIUM | BACKLOG | β | runbooks/*.md |
All runbooks have metadata table, query examples, decision tree, escalation |
| R-P35 | Phase 14 | Runbooks | Add runbooks/model_rollback.md |
MEDIUM | BACKLOG | R-P34 | runbooks/model_rollback.md |
File present; meets operational template standard |
| R-P36 | Phase 14 | Runbooks | Add runbooks/feature_store_corruption.md |
MEDIUM | BACKLOG | R-P34 | runbooks/feature_store_corruption.md |
File present; meets operational template standard |
| R-P37 | Phase 14 | Runbooks | Add runbooks/runbook_test_log.md β game-day exercise evidence |
MEDIUM | BACKLOG | R-P34 | runbooks/runbook_test_log.md |
File present; at least one exercise entry documented |
| R-P38 | Phase 14 | Observability | Add configs/slos.yml β numeric SLO definitions |
MEDIUM | BACKLOG | β | configs/slos.yml |
File present; values match Grafana dashboard thresholds |
| R-P39 | Phase 15 | Observability | Add api/metrics.py β Prometheus endpoint with Counter, Histogram, Gauge |
MEDIUM | BACKLOG | β | api/metrics.py |
/metrics endpoint responds; curl shows metric names |
| R-P40 | Phase 15 | Observability | Register metrics router in api/main.py |
MEDIUM | BACKLOG | R-P39 | api/main.py |
GET /metrics returns 200 with Prometheus text format |
| R-P41 | Phase 15 | Observability | Instrument create_incident path with metric labels |
MEDIUM | BACKLOG | R-P39, R-P40 | api/routers/incidents.py |
Metrics visible in Prometheus scrape after POST |
| R-P42 | Phase 15 | Observability | Bind runbook threshold values to real Prometheus query expressions in configs/slos.yml |
MEDIUM | BACKLOG | R-P38, R-P39 | configs/slos.yml |
SLO file references real metric names from api/metrics.py |
| R-P43 | Phase 15 | Observability | Update Grafana dashboard JSON to use real metric names | MEDIUM | BACKLOG | R-P39, R-P42 | dashboards/ml_operations_overview.json |
Dashboard panels show live data in local Compose stack |
| R-P44 | Phase 16 | MLOps | Confirm HF Space org/slug β README shows zrlo/ml-incident-api, verify correct |
MEDIUM | BACKLOG | β | README.md, deploy-hf.yml |
HF Space URL resolves; workflow targets correct slug |
| R-P45 | Phase 16 | MLOps | Verify Dockerfile port β HF app_port: 8080; confirm FastAPI binds 0.0.0.0:8080 |
MEDIUM | BACKLOG | β | Dockerfile |
Container starts and responds on 8080 |
| R-P46 | Phase 16 | MLOps | Provision external Postgres (Neon free tier) β connect string β HF Secret | MEDIUM | BACKLOG | β | HF Space Secrets | GET /ready returns 200 on live HF Space |
| R-P47 | Phase 16 | MLOps | Provision external Redis (Upstash free tier) β connect string β HF Secret | MEDIUM | BACKLOG | β | HF Space Secrets | Rate limiting functional on live HF Space |
| R-P48 | Phase 16 | MLOps | Add scripts/seed_demo_user.py β seeds read-only demo user on first boot |
MEDIUM | BACKLOG | β | scripts/seed_demo_user.py |
Script idempotent; demo user exists after run |
| R-P49 | Phase 16 | MLOps | Add make deploy-hf + make hf-status Makefile targets |
MEDIUM | BACKLOG | R-P44, R-P45 | Makefile |
make deploy-hf pushes to HF Space remote |
| R-P50 | Phase 16 | MLOps | Smoke-test live HF endpoint: GET /health, POST /auth/token, GET /incidents |
MEDIUM | BACKLOG | R-P46, R-P47, R-P48 | β | All three requests return expected responses on live Space |
π‘ LOW β Polish & PortfolioΒΆ
| ID | Phase | Category | Issue | Sev | Status | Blocking Deps | Files Affected | Validation |
|---|---|---|---|---|---|---|---|---|
| R-P51 | Phase 17 | Portfolio | Add "What Senior Reviewers Will Find" section to README | LOW | BACKLOG | R-P28, R-P33 | README.md |
Section present; copy is accurate and current |
| R-P52 | Phase 17 | Portfolio | Add "Known Limitations" callout section to README | LOW | BACKLOG | β | README.md |
Section present; no overselling |
| R-P53 | Phase 17 | Portfolio | Update Roadmap section to strategic format: Q3 2026 / Q4 2026 / Aspirational | LOW | BACKLOG | R-P50 | README.md |
Roadmap reflects actual planned phases |
| R-P54 | Phase 17 | Portfolio | Verify all badge URLs resolve and are accurate | LOW | BACKLOG | R-P20 | README.md |
All badges return 200; values match CI state |
| R-P55 | Phase 17 | Portfolio | Verify all internal doc links are not broken | LOW | BACKLOG | R-P34βR-P37 | README.md, docs/ |
mkdocs build --strict passes with 0 broken links |
| R-P56 | Phase 17 | Portfolio | Final README read-through β remove stale Fly.io references; verify zrl.dev link |
LOW | BACKLOG | R-P51βR-P55 | README.md |
Zero Fly.io references; zrl.dev links resolve |
Completed ArchiveΒΆ
Cycle 3 (2026-05-29)ΒΆ
| ID | Issue | Commit | Resolution |
|---|---|---|---|
| R-P22 | Characterization tests for src/incident_tracker.py |
1db8d11 |
tests/unit/test_incident_tracker_char.py added; full CRUD, keyset pagination, state machine, init_db, get_session coverage |
| R-P4 | Semgrep fails for forks/Dependabot with empty token | 1db8d11 |
if: env.SEMGREP_APP_TOKEN != '' condition added; hard gate retained for owned branches |
| R-P7 | CONTRIBUTING.md absent |
1db8d11 |
Full onboarding guide added |
| R-P13 | docker-compose.yml missing/unverified |
1db8d11 |
Verified and hardened: postgres:16-alpine + redis:7-alpine with health-checks |
| R-P20 | README badges absent | 1db8d11 |
Python version and security badges added; CI/codecov/Codacy confirmed present |
| R-P16 | /healthz//readyz K8s probe rename |
1db8d11 β REVERTED 098fe0b |
Reverted 2026-05-29 β no Kubernetes in project; broke 5 unit tests expecting /health and /ready. Routes restored to originals. |
Cycle 2 (2026-05-29)ΒΆ
| ID | Issue | Commit | Resolution |
|---|---|---|---|
| R-P11 | SlowAPI get_remote_address stored raw IP as rate-limiter Redis key β HIGH-01 final PII vector |
660005862 |
Replaced with _rate_limit_key(): SHA-256(best-available-identifier)[:16]; raw IPs no longer enter limiter state |
| R-P21 | No regression test for HIGH-01 privacy protections across middleware + rate limiter | 660005862 |
Added tests/unit/test_middleware_pii.py β 5 tests |
Cycle 1 (2026-05-29)ΒΆ
| ID | Issue | Commit | Resolution |
|---|---|---|---|
| R-P2 | Duplicate lint: target in Makefile β mypy silently skipped on pipelines/ |
0890c89 |
Single deduplicated target; pipelines/ added to mypy + ruff scope |
| R-P3 | test-int gate 65% local vs 53% CI |
0890c89 |
Aligned to 53%; CI-67 recovery path documented |
| R-P5 | Pre-commit mypy --ignore-missing-imports diverges from CI strict config |
b028d6c |
Removed flag; added stub deps to additional_dependencies |
| R-P6 | MASTER_ACTION_TRACKER.md reference in CHANGELOG |
171759d |
File merged into this log (2026-05-29); local copy deleted |
| R-P8 | CODEOWNERS minimal |
af04cc5 |
Hardened with explicit security-sensitive and dependency manifest paths |
| R-P9 | RequestTimeoutMiddleware logs raw client IP |
c7ea849 |
_pseudo_ip() applied |
| R-P10 | MaxBodySizeMiddleware logs raw client IP in 2 branches |
c7ea849 |
_pseudo_ip() applied to both branches |
| R-P29 | pytest-xdist in requirements-dev.txt |
pre-existing | Confirmed already present; -n auto wired in CI |
| R-P30 | test_model_registry_thread_safety.py |
pre-existing | Confirmed already present in CI test list |
| R-P31 | Redis denylist concurrency tests | pre-existing | Confirmed test_redis_denylist_concurrency.py already present |
| R-P32 | test_incident_service_contract.py |
pre-existing | Confirmed already present in CI test list |
Pre-Engagement ArchiveΒΆ
Phase 6 β Repo Hygiene + CI Accuracy (2026-05-27)ΒΆ
| ID | Finding | Sev | Status | Files Changed |
|---|---|---|---|---|
| R-02 | Orphaned .github/release-placeholder-v110.txt removed |
LOW | β CLOSED | .github/release-placeholder-v110.txt |
| R-06 | Fabricated Docker digest claim removed; honest TODO added pending network verification | HIGH | β CLOSED | Dockerfile |
| R-08 | secured_ci.yml SHA reference block stale for setup-python |
HIGH | β CLOSED | .github/workflows/secured_ci.yml |
| R-09 | secured_ci.yml SHA reference block stale for upload-artifact |
HIGH | β CLOSED | .github/workflows/secured_ci.yml |
| R-10 | Workflow permissions audit completed for secured_ci.yml, mermaid-render.yml, stale.yml, codeql.yml |
MED | β PARTIAL | .github/workflows/*.yml |
| R-25 | mermaid-render.yml SHA reference block stale for setup-node; loop safety validated |
LOW | β CLOSED | .github/workflows/mermaid-render.yml |
| CI-51 | stale.yml floating tag actions/stale@v9 SHA-pinned to verified commit |
MED | β CLOSED | .github/workflows/stale.yml |
| CI-52 | docs.yml SHA reference block stale for setup-python; synced to live pin |
LOW | β CLOSED | .github/workflows/docs.yml |
Phase 3 β Architecture (Complete)ΒΆ
| ID | Finding | Sev | Status | Files Changed |
|---|---|---|---|---|
| ARCH-01 | HS256 symmetric JWT β upgrade to RS256 + JWKS rotation | HIGH | β CLOSED | src/auth/jwt_rs256.py; JWKS endpoint /.well-known/jwks.json |
| ARCH-02 | passlib β argon2-cffi password hashing (OWASP 2024) |
HIGH | β CLOSED | src/auth/password.py, requirements.txt |
| ARCH-03 | _USERS dict β PostgresUserRepository database-backed |
HIGH | β CLOSED | src/users/repository.py, api/app.py, alembic/versions/0001_initial_schema.py |
| ARCH-04 | Secrets via Vault / AWS Secrets Manager (zero-secret images) | HIGH | β CLOSED | docs/policies/secrets_management.md |
| ARCH-05 | GDPR /users/me/export and /users/me DELETE endpoints |
MED | β CLOSED | api/gdpr_routes.py |
| ARCH-06 | Argon2 rehash-on-login migration (zero-downtime) | MED | β CLOSED | src/auth/password.py, src/users/repository.py |
| ARCH-07 | Rate limiting: per-user sliding window via Redis | MED | β CLOSED | api/rate_limit.py |
Phase 2 β Infrastructure, Config, CI/CD (Complete)ΒΆ
| ID | Finding | Sev | Status | Files Changed |
|---|---|---|---|---|
| CI-06 | pip-audit had β true bypass β CVE findings silently ignored |
HIGH | β CLOSED | ci_cd/secure-ci.yml |
| CI-07 | No semgrep SAST β OWASP Top 10 pattern coverage gap | MED | β CLOSED | ci_cd/secure-ci.yml |
| CI-08 | GitHub Actions not pinned to SHA digest β tag-mutation risk | MED | β CLOSED | ci_cd/secure-ci.yml |
| CI-09 | Coverage threshold 70% β insufficient for security-critical auth paths | MED | β CLOSED | ci_cd/secure-ci.yml (raised to 75%) |
| CI-11 | No dependency-review action on PRs | MED | β CLOSED | ci_cd/secure-ci.yml |
| CI-12 | Container scan built with ENVIRONMENT=test (misses production-mode issues) | LOW | β CLOSED | ci_cd/secure-ci.yml |
| CI-13 | REDIS_PASSWORD not in CI test environment | MED | β CLOSED | ci_cd/secure-ci.yml |
| CI-14 | GitHub Actions workflow-level write-all permissions |
MED | β CLOSED | ci_cd/secure-ci.yml (per-job least-privilege) |
| CFG-01 | No centralized Settings class β os.environ scattered across modules | MED | β CLOSED | src/config.py (pydantic-settings + startup validation) |
| CFG-02 | .gitignore had unresolved merge conflict markers |
HIGH | β CLOSED | .gitignore |
| CFG-03 | .env.example missing REDIS_PASSWORD and DEV_*_PASSWORD |
MED | β CLOSED | .env.example |
Phase 1 β Critical Security Hardening (Complete)ΒΆ
| ID | Finding | Sev | Status | Files Changed |
|---|---|---|---|---|
| P1-01 | noreply@ security contact replaced with GitHub Private Vulnerability Reporting |
HIGH | β CLOSED | SECURITY.md |
| P1-02 | SECURITY.md controls table corrected to match actual repository state |
MED | β CLOSED | SECURITY.md |
| P1-03 | Airflow scope item removed β template copy-paste artifact, project has no Airflow | LOW | β CLOSED | SECURITY.md |
| P1-04 | All GitHub Actions workflows pinned to SHA digests (supply chain hardening) | HIGH | β CLOSED | .github/workflows/ci.yml |
| P1-05 | Trivy gate restored to blocking exit-code: '1' (was bypassed as CI-26) |
HIGH | β CLOSED | .github/workflows/ci.yml |
| P1-06 | pip-audit JSON artifact generation decoupled from hard gate (removes silent failure) |
MED | β CLOSED | .github/workflows/ci.yml |
| P1-07 | CI secret availability guard added to integration-tests job | MED | β CLOSED | .github/workflows/ci.yml |
| P1-08 | CodeQL semantic SAST workflow added (security-and-quality query suite) |
HIGH | β CLOSED | .github/workflows/codeql.yml |
| CI-10 | Branch protection ruleset enforced on main |
MED | β CLOSED | GitHub repo Settings β Rulesets |
Phase 0 β Critical/High/Medium (Complete)ΒΆ
| ID | Finding | Sev | Status | Files Changed |
|---|---|---|---|---|
| CRIT-A | Hard-coded stub user passwords "admin-dev-only" silently used as fallback |
CRIT | β CLOSED | api/app.py |
| CRIT-B | Async/sync denylist boundary: asyncio.get_event_loop().run_until_complete() inside async context raised RuntimeError on every revocation |
CRIT | β CLOSED | src/redis_denylist.py, api/app.py Γ3 |
| CRIT-C | Airflow + dbt dependencies in API image: +800 MB, arbitrary code exec surface | CRIT | β CLOSED | requirements.txt, new requirements-airflow.txt |
| HIGH-A | Redis: no AUTH, all-interface bind (0.0.0.0), no password |
HIGH | β CLOSED | docker-compose.yml, .env.example |
| HIGH-B | python-jose CVE-2024-33663 (JWT algorithm confusion) |
HIGH | β CLOSED | requirements.txt (PyJWT), api/app.py |
| HIGH-C | No request body size limit β OOM DoS; no request timeout β slow-loris | HIGH | β CLOSED | api/middleware.py, api/app.py |
| HIGH-D | lru_cache on get_settings() causes env var bleed between tests |
HIGH | β CLOSED | tests/conftest.py, src/config.py |
| MED-A | Dockerfile base image on floating tag (tag-mutation supply chain attack) | MED | β CLOSED | Dockerfile (SHA-256 digest pinned on both stages) |
| MED-B | Full repo .:/app bind mount; .dockerignore missing |
MED | β CLOSED | docker-compose.yml, .dockerignore |
| MED-C | SlowAPI rate limiting missing on /auth/token |
MED | β CLOSED | api/app.py |
| MED-D | passlib unmaintained; bcrypt unpinned |
MED | β CLOSED | requirements.txt |
| MED-E | Missing OWASP security headers (CSP, HSTS, X-Frame-Options, etc.) | MED | β CLOSED | api/middleware.py |
| LOW-A | .DS_Store tracked in git; merge conflicts in .gitignore |
LOW | β CLOSED | .gitignore |
| LOW-B | Telemetry ports bound to 0.0.0.0 (Jaeger UI, OTel receiver, Prometheus) |
LOW | β CLOSED | docker-compose.yml (loopback-only: 127.0.0.1:*) |
Pre-Engagement MilestonesΒΆ
| ID | Phase | Issue | Resolution | Date |
|---|---|---|---|---|
| β | Phase 11 | Supply chain: lockfile-check CI job; pip-audit pre-commit; make deps-compile (CI-55) |
Complete | 2026-05-28 |
| β | Phase 8 | Portfolio presentation: "Quick Proof of Quality" table in README | Partial complete | 2026-05-27 |
| β | Phase 7 | HF Spaces scaffolding: README YAML frontmatter; deploy-hf.yml workflow |
Partial complete | 2026-05-27 |
| β | Phase 4 | Grafana + Prometheus infra: docker-compose.yml; dashboards/ml_operations_overview.json |
Complete | 2026-05-27 |
| β | Phase 1 | Security hardening: Semgrep hard gate, CI_POSTGRES_PASSWORD secret ref, Bandit gate, per-job permissions: |
Complete (CI-52) | 2026-05-26 |
| R-GOD | β | God-file api/app.py (1005 lines) extracted into routers, middleware, auth, config modules |
Complete | Pre-engagement |
CI-10 Ruleset Detail β main branch (enforced 2026-05-24)ΒΆ
| Rule | Setting |
|---|---|
| Restrict deletions | β Enabled |
| Require signed commits | β Enabled |
| Require a PR before merging | β Enabled |
| Required status checks | secrets-scan, dependency-audit, SAST - Bandit + mypy, test, π§ͺ Tests (Python 3.11) |
| Require branches to be up to date | β Enabled |
| Block force pushes | β Enabled |