ADR 0028: QA Evaluation Harness and Calibration Gates¶
Status¶
Accepted
Problem¶
Pipeline quality checks lacked a dedicated fixture-driven evaluation harness that could assert stage-level regressions on realistic narrative inputs. CI also did not publish a durable QA artifact showing alignment/confidence drift over time.
Non-goals¶
- Replacing deterministic rule-based pipeline stages with model-backed systems.
- Introducing external QA services or online benchmarking dependencies.
- Altering API authentication or storage contracts for production story data.
Public API¶
New public CLI surface:
story-qa-eval- Runs fixture-driven QA evaluation.
- Supports strict failure mode and JSON summary output path.
Repository-level quality workflow updates:
- CI executes
story-qa-eval --strict. - CI uploads
work/qa/evaluation_summary.jsonas a build artifact. - Pre-push checks execute strict QA evaluation gate.
Invariants¶
- QA fixtures are versioned and stored in-repo.
- Harness computes and records per-segment translation alignment scores.
- Strict mode fails when fixture expectations or calibration thresholds fail.
- Fixture corpus always includes:
- mixed-language/code-switch translation cases
- at least one adversarial chronology case
- at least one hard-negative calibration case
- Calibration thresholds are evaluated against explicit positive/negative fixture splits and surfaced in summary output.
Test plan¶
- Add unit tests for evaluation harness pass/fail behavior and fixture coverage.
- Add CLI entrypoint tests for QA evaluation command output.
- Extend project contract tests for:
- Makefile QA target
- CI QA harness + artifact upload step
- QA docs presence
- Run full repository gates:
- import/contract checks
- lint/format/type checks
- pytest
- strict canary + strict QA evaluation
- docs build and frontend checks