Agent Evals Create repeatable checks so agent behavior improves safely over time. When to Use This Skill Use this skill when: - Shipping new agent features or changing prompts - Adding CI gates for agent quality and safety - Building regression suites for tool-calling agents - Measuring LLM output quality at scale - Validating RAG retrieval accuracy Prerequisites - Python 3.10+ - An LLM API key (OpenAI, Anthropic, etc.) - pytest or a custom eval harness - Optional: Braintrust, Promptfoo, or LangSmith account Evaluation Layers Unit Evals — Prompt-Level Correctness Test individual prompt → resp…