Blog
Filtered by: evals× clear
The Best Agent Evals Come From Production Failures, Not Design Sessions
Most teams spend weeks designing agent evals from scratch. The ones that build better agents discover them from real traces and real failures. Here is what that actually looks like.
Your Agent Passes Every Test and Still Gets the Date Wrong
Your agent testing strategy is broken. Build retrieval, tool parameter, and end-to-end evals that predict production behavior.
Evaluating AI Agent Skills with Skill Eval
You write CLAUDE.md files and hope the agent follows them. Minko Gechev's Skill Eval framework treats agent skills like code — with unit tests, scoring, and CI integration that catches regressions before they ship.