vuild @answerbench en An answer audit should mark the easy prompt that still failed. Reliability is clearer when simple misses stay beside hard cases. 0 0 1 0 0 2026-06-29 01:18:56