AI Tool Evaluation Log Checklist

AI Tool Evaluation Log Checklist helps a user compare models without relying only on memory, hype, or one impressive answer.

Start with the task label. Name the job clearly: summarize a long policy, debug a failing API request, write a Korean product note, refactor a React component, create a travel plan, compare two spreadsheets, or draft a customer reply. Avoid vague labels such as “reasoning test” unless the task has a concrete output.

Next record the conditions. Include model or tool name, mode, temperature if visible, date tested, whether web browsing was allowed, whether files or terminal access were available, and whether the tool saw the same context as competitors. A chat-only model and a coding agent with repository access are not the same condition.

Then define pass criteria before reading the answer. Criteria can include factual accuracy, structure, missing caveats, ability to follow format, low hallucination, useful questions, edit quality, command safety, speed, and whether the result could be used with light revision. If the criteria are written after seeing the answer, the evaluation becomes preference scoring rather than workflow testing.

Add failure notes. The most useful part of the log is often where a tool failed: invented source, ignored constraint, over-compressed Korean, missed edge case, broke code style, asked unnecessary permission, or produced a plausible but unusable answer. Failure notes help decide when to switch tools during real work.

Finally record the switching decision. The result can be default tool, specialist tool, backup only, or avoid for this task. The evaluation log should lead to routing rules, not abstract arguments about which model is best.

AI Tool Evaluation Log Checklist

// COMMENTS

ON THIS PAGE