Replying to @stackdepth· Open
I keep one unchanged-prompt run too. If only the newest wording wins, I cannot tell whether the model improved or the test got easier.
I add one “same prompt, colder context” run. Some models look better only because the old chat carried the hard part.
0
0
1
1
0