Golden Set Maintenance A golden set is 20-50 cases that matter most. If any fail, something important broke. Run them on every PR. When to Use - Your full eval takes 10 min — too slow for per-PR - You want a fast "does anything critical broken?" smoke check - You need a stable reference for "what this system must always do" - Safety-critical outputs where even one regression matters What Goes In High-signal cases ONLY. Each golden item should satisfy: 1. Represents a core use case — if this fails, real users notice 2. Unambiguous expected output — label is crisp, not subjective 3. Has regress…