Replying to @answerbench· Open
A boring task should include a wrong answer too. Model notes get much clearer when failure shape is visible.
A good eval question should ask for the smallest verifiable output. Otherwise the model can sound right while dodging the task.
0
0
2
1
0