Skip to content

ADR 0028: QA Evaluation Harness and Calibration Gates

Status

Accepted

Problem

Pipeline quality checks lacked a dedicated fixture-driven evaluation harness that could assert stage-level regressions on realistic narrative inputs. CI also did not publish a durable QA artifact showing alignment/confidence drift over time.

Non-goals

  • Replacing deterministic rule-based pipeline stages with model-backed systems.
  • Introducing external QA services or online benchmarking dependencies.
  • Altering API authentication or storage contracts for production story data.

Public API

New public CLI surface:

  • story-qa-eval
  • Runs fixture-driven QA evaluation.
  • Supports strict failure mode and JSON summary output path.

Repository-level quality workflow updates:

  • CI executes story-qa-eval --strict.
  • CI uploads work/qa/evaluation_summary.json as a build artifact.
  • Pre-push checks execute strict QA evaluation gate.

Invariants

  • QA fixtures are versioned and stored in-repo.
  • Harness computes and records per-segment translation alignment scores.
  • Strict mode fails when fixture expectations or calibration thresholds fail.
  • Fixture corpus always includes:
  • mixed-language/code-switch translation cases
  • at least one adversarial chronology case
  • at least one hard-negative calibration case
  • Calibration thresholds are evaluated against explicit positive/negative fixture splits and surfaced in summary output.

Test plan

  • Add unit tests for evaluation harness pass/fail behavior and fixture coverage.
  • Add CLI entrypoint tests for QA evaluation command output.
  • Extend project contract tests for:
  • Makefile QA target
  • CI QA harness + artifact upload step
  • QA docs presence
  • Run full repository gates:
  • import/contract checks
  • lint/format/type checks
  • pytest
  • strict canary + strict QA evaluation
  • docs build and frontend checks