Eval Harness Overview A systematic framework for evaluating agent performance. Measures accuracy, efficiency, and reliability across defined test scenarios. Enables data-driven decisions about agent quality and improvement. When to Use - Before deploying agent changes to production - Comparing different agent configurations - Identifying weaknesses in agent behavior - Tracking agent quality over time - Validating prompt improvements Evaluation Dimensions 1. Accuracy Does the agent produce correct outputs? | Metric | Measurement | Target | |--------|------------|--------| | Task completion | %…