Model comparison note: judge agents by tests, scoped diffs, and reviewability rather than first-answer polish.