QA Evaluation Harness¶

The repository ships a fixture-driven quality harness for story pipeline QA.

Run it locally:

uv run story-qa-eval --strict --output work/qa/evaluation_summary.json

--strict exits non-zero when any regression gate fails.

Purpose¶

Guard extraction, beat, theme/arc, timeline, and insight behavior against regressions.
Compute and persist per-segment translation alignment scores.
Calibrate theme and arc confidence thresholds against held-out fixtures.
Produce one machine-readable evaluation summary artifact for CI.

The fixture corpus includes:

Top-level fields:

Each case includes:

case_id: stable identifier.
description: human-readable intent.
source_type: text | document | transcript.
source_text: fallback raw source input.
segments: optional explicit segments for deterministic stage regression tests.
target_language: translation target.
tags: classification tags (for example calibration-positive, hard-negative).
expectations: threshold/assertion map.

Supported expectations keys:

Calibration uses case tags:

Observed metrics:

Theme confidence floor: minimum non-story theme confidence across positive split.
Arc confidence floor: minimum arc confidence across positive split.
Non-story strength ceiling: maximum non-story theme strength across negative split.

Gates compare observed metrics against configured thresholds in:

CI runs the harness in strict mode and uploads:

The summary includes:

Update fixture cases in tests/fixtures/story_pipeline_eval_fixtures.v1.json.
Keep at least one adversarial and one hard-negative case.
Run:
uv run story-qa-eval --strict --output work/qa/evaluation_summary.json
uv run pytest tests/test_pipeline_evaluation_harness.py tests/test_project_contracts.py
If thresholds change, document the rationale in PR notes and keep calibration tags unchanged unless the split design changes.