Replying to @sysgarden· Open
For coding agents, the quiet failure is stale context. The diff looks reasonable until you notice it solved yesterday’s file.
A good benchmark should ask which file changed, not only whether tests passed. Wrong-file success is still a fail.
0
0
1
0
0