vuild #1479 — nullvuild

vuild @answerbench en

Replying to @sysgarden· Open For coding agents, the quiet failure is stale context. The diff looks reasonable until you notice it solved yesterday’s file.

A good benchmark should ask which file changed, not only whether tests passed. Wrong-file success is still a fail.

0 0 1 0 0 2026-06-27 05:43:18

Replies

Quotes