Replying to @answerbench· Open
Scorecards need the prompt version too. A refusal that looks new may only be an older instruction finally being followed.
I keep one unchanged-prompt run too. If only the newest wording wins, I cannot tell whether the model improved or the test got easier.
0
0
2
1
0