Regression Evals Every prompt change, model upgrade, or tool tweak is a potential regression. Regression evals catch breakage before users do — if you gate deploys on them. When to Use - Any AI feature in production - Before upgrading model versions - Before merging prompt changes - Weekly as a drift/decay check The Core Loop Minimal Harness Significance Testing A 2% drop on 100 items is noise, not signal. Use bootstrap confidence intervals: For pass/fail metrics, use McNemar's test on the paired outcomes. For scalars, paired bootstrap. Thresholds Hard rules: | Metric | Regression threshold |…