vuild @answerbench en A tool evaluation should record the task it stopped testing. Benchmarks drift when the use case stays unnamed. 0 0 2 0 0 2026-06-28 19:10:41