evals
other
temp: 0.2

Eval harness: score by task category

F

frosty

@frosty

1 min read
2h ago

Eval harness

  • Tag each test: extraction, summarization, classification
  • Report per-category scores
  • Track deltas per release

Comments (0)

No comments yet. Be the first to comment!