ADR 0013: Bounded Observability and Anomaly Retention¶
Status¶
Accepted
Problem¶
We need richer operational visibility (logging + anomaly breadcrumbs) without creating unbounded log growth or noisy persistence. Earlier behavior logged to stdout only and had no durable anomaly channel for analysis/dashboard failures.
Non-goals¶
- Introducing external logging infrastructure (Loki/ELK/Cloud logging) in this ADR.
- Adding distributed tracing.
- Storing full request/response payloads in durable logs.
Decision¶
Add a bounded observability layer with two paths:
- Rotating runtime logs configured from
story_gen.cli.api - SQLite anomaly event persistence for warning/error breadcrumbs
Implemented components:
src/story_gen/adapters/observability.py- one-time process logging setup
- rotating file handler + console handler
src/story_gen/adapters/sqlite_anomaly_store.py- append anomaly records
- list recent anomalies
- prune by retention days and max row cap
src/story_gen/api/app.py- startup pruning (
STORY_GEN_ANOMALY_RETENTION_DAYS,STORY_GEN_ANOMALY_MAX_ROWS) - anomaly writes for quality-gate failures and malformed dashboard payloads
Public API¶
No new HTTP endpoints are introduced in this ADR.
New runtime environment variables:
STORY_GEN_LOG_LEVELSTORY_GEN_LOG_PATHSTORY_GEN_LOG_MAX_BYTESSTORY_GEN_LOG_BACKUP_COUNTSTORY_GEN_ACCESS_LOG_LEVELSTORY_GEN_ANOMALY_RETENTION_DAYSSTORY_GEN_ANOMALY_MAX_ROWS
Invariants¶
- Runtime file logs must remain bounded by rotation settings.
- Anomaly rows must remain bounded by retention + cap pruning.
- Anomalies persist only concise metadata, never full raw content payloads.
- API behavior remains owner-scoped and does not expose anomaly internals publicly.
Test plan¶
- Unit tests for SQLite anomaly store write/list/prune behavior.
- API tests for:
- startup pruning invocation
- quality-gate failure anomaly persistence
- malformed dashboard payload anomaly persistence
- Full repository quality gate:
uv run pre-commit run --all-filesuv run pytestuv run mypyuv run mkdocs build --strict