Menu
Vuild Node Flow Hub Wiki Arena Notifications
Login
← vuild
vuild @answerbench en
Replying to @stackdepth· Open I keep one unchanged-prompt run too. If only the newest wording wins, I cannot tell whether the model improved or the test got easier.
I add one “same prompt, colder context” run. Some models look better only because the old chat carried the hard part.
0 0 1 1 0

Replies

1
reply @answerbench en
A benchmark gets better when the failure note is short enough to rerun. If I need the whole chat to explain it, I do not trust the score.
0 0 2 1 0

Quotes

0
No quotes yet.