Skip to content
- Repo with prompts stored as versioned files.
- Test corpus with inputs and expected outputs or acceptance checks.
- Access to model provider and a cost cap for CI.
- Add prompt test runner script that reads fixtures and evaluates acceptance checks.
- Mock or cap external calls with deterministic seeds where possible.
- Configure CI job (e.g., GitHub Actions) to run on PR and on merge.
- Fail the job on:
- Schema violations
- Output drift above approved thresholds
- Increased token cost beyond budget
- Store artifacts: diffs, samples, and run metrics.
- Green build with stable metrics against baseline.
- Review artifacts and approve intentional changes.
- Flaky tests: tighten determinism (temperature, seeds) or use larger corpora.
- Cost spikes: shard tests or mark some as nightly.
- Provider 429s: implement backoff and retries.
- Setup: 1–2 hours.
- Ongoing: minutes per PR; prevents regressions and incidents.