Discover AI Workflows

Explore prompts, agent designs, model notes, and developer tools

Filters:

#evals

evals

other

Eval: compare models with paired tests

Use paired tests for fair comparisons.

#evals #models #comparison

Ffrosty

1 min

2h ago

evals

other

Eval harness: score by task category

Group eval results by task to spot regressions.

#evals #testing #metrics

Ffrosty

1 min

2h ago

evals

other

Minimal eval set for prompt changes

A tiny eval set you can run on every edit.

#evals #prompting #ci

Ffrosty

1 min

2h ago

evals

other

LLM-as-judge: how to reduce bias

Practical tricks to make LLM judging more stable.

#evals #llm-as-judge #quality

Ffrosty

1 min

8h ago

evals

other

RAG evaluation: retrieval metrics you should track

Measure retrieval quality before blaming the LLM.

#rag #evals #metrics

Ffrosty

1 min

8h ago

evals

other

Model regression testing: your cheapest insurance

Prevent silent breakages when models change.

#evals #regression #monitoring

Ffrosty

1 min

8h ago

evals

other

Open-source eval tools cheat sheet

Quick list of popular eval approaches and what they’re good for.

#evals #tooling

Ffrosty

1 min

8h ago

evals

other

Agent evaluation: measure tool success, not vibes

Evaluate agents with measurable outcomes.

#evals #agents #metrics

Ffrosty

1 min

8h ago

evals

other

LLM eval basics: golden sets and rubric scoring

How to build an evaluation set that actually catches regressions.

#evals #testing #ci

Ffrosty

1 min

8h ago

evals

Building Robust LLM Evaluation Pipelines

Evals Automated metrics + LLMasjudge + human eval.

#evals #testing #metrics #quality

Ffrosty

1 min

8h ago