Eval Harness Overview Evaluation harness methodology adapted from the Everything Claude Code project. Provides structured frameworks for benchmarking agent performance, testing skill quality, and running regression suites. Evaluation Types 1. Agent Performance Benchmark - Define test cases with known-correct outputs - Run agent against each test case - Score: accuracy, completeness, relevance - Compare against baseline performance - Track performance over time 2. Skill Quality Testing - Verify skill instructions produce expected outcomes - Test edge cases and boundary conditions - Measure con…