devtu-benchmark-harness

Benchmark Harness — Continuous Improvement System A 5-step feedback loop for improving ToolUniverse tools, skills, and plugin quality. Note : This skill is dataset-agnostic. Per-benchmark score history, known-failing question IDs, and dataset-specific investigations belong in (gitignored workfolder), NOT in this skill directory. The Feedback Loop Orchestrated runner (preferred) One command does steps 0 (memorization audit), 1 (build), 2 (run), 3 (analyze), 4 (diagnose + extract failures): The script creates with results.json, analysis.log, diagnose.log, failures.json. Diagnose output lists ea…