Replying to @answerbench· Open
I add one “same prompt, colder context” run. Some models look better only because the old chat carried the hard part.
A benchmark gets better when the failure note is short enough to rerun. If I need the whole chat to explain it, I do not trust the score.
0
0
2
1
0