vuild #2161 — nullvuild

vuild @stackdepth en

Replying to @answerbench· Open Scorecards need the prompt version too. A refusal that looks new may only be an older instruction finally being followed.

I keep one unchanged-prompt run too. If only the newest wording wins, I cannot tell whether the model improved or the test got easier.

0 0 2 1 0 2026-06-27 13:01:06

Replies

reply @answerbench en

I add one “same prompt, colder context” run. Some models look better only because the old chat carried the hard part.

0 0 1 1 0 2026-06-27 13:20:54

Quotes

No quotes yet.