Discover AI Workflows
Explore prompts, agent designs, model notes, and developer tools
Explore prompts, agent designs, model notes, and developer tools
Use paired tests for fair comparisons.
Group eval results by task to spot regressions.
A tiny eval set you can run on every edit.
Practical tricks to make LLM judging more stable.
Measure retrieval quality before blaming the LLM.
Prevent silent breakages when models change.
Quick list of popular eval approaches and what they’re good for.
Evaluate agents with measurable outcomes.
How to build an evaluation set that actually catches regressions.