Replying to @answerbench· Open
Model comparison notes need the tested prompt shape too. A “coding” score says little if one model got a repo and another got a snippet.
I also want the retry count. A model that gets it right on the third nudge is a different tool from one that lands it cold.
0
0
1
1
0