How do I evaluate an AI agent?

Define success criteria, build a representative test set including hard edge cases, run the agent and capture the full trajectory, score outcome and behaviour against your criteria, categorise failures, and re-run after every change. Then keep evaluating in production.

Why is a good demo not enough to evaluate an agent?

A demo shows the agent can succeed once. Evaluation measures whether it succeeds reliably across the messy distribution of real cases. A polished single run tells you nothing about edge cases or consistency — which is where agents usually fail.

Should I score the final answer or the whole trajectory?

The whole trajectory. An agent acts over many steps, so you check whether it chose the right actions, used tools correctly, stayed within its authority, and escalated when it should. A right outcome reached by a forbidden action is still a failure.

Can I use another model to evaluate an AI agent?

Yes, a model can grade outputs against a rubric at scale, which is useful. But validate that grader against human judgement so you are not trusting one model to mark another unchecked. Combine it with programmatic checks and human review.

Do I need to keep evaluating an agent after launch?

Yes. Data, tools, and the model can change, and behaviour drifts. Continuous production evaluation — sampling live runs, scoring them, and watching drift metrics — is what keeps a good launch a reliable system over time.

AIMOCS · Learn

Explainer

How to evaluate an AI agent

How to tell whether an AI agent actually works — defining success, building a test set, scoring the whole trajectory not just the answer, and evaluating continuously in production.

Book a consultation

01TL;DR

02The starting point

Define success before you measure

Evaluation is impossible without a definition of correct. Before testing, write down what a successful run looks like for your agent: the right outcome, yes, but also acceptable behaviour along the way — did it stay within its authority, escalate when it should, and avoid harmful actions? Vague goals like "handle support well" cannot be scored; specific criteria like "resolves the top twenty repeatable questions correctly and routes the rest to the right queue" can. The clarity of your success criteria caps the quality of every evaluation that follows.

Demos lie. A polished single run tells you the agent can succeed once, not that it succeeds reliably across the messy distribution of real cases. Reliability, not the highlight reel, is what evaluation measures.

03The method

Build a test set and score it

01Assemble a test set of real, representative cases — the routine majority plus the awkward edge cases that break naive agents.
02Run the agent against each case and capture the full trajectory: every step, tool call, and decision, not just the final output.
03Score against your criteria — outcome correctness, plus whether the path was acceptable and within authority.
04Analyse failures by category — bad retrieval, wrong tool, flawed plan, missed escalation — so fixes target the real cause.
05Re-run after every change, so you can prove a fix helped and did not quietly break something else.

04The methods

Who or what does the scoring

Scoring uses a mix of methods. Programmatic checks verify objective facts — did it post the right amount, hit the correct record, stay under the authority limit? Human review judges quality and nuance that rules cannot capture. And a separate model can act as a grader for scale, checking outputs against a rubric — useful, but itself something to validate against human judgement so you are not trusting one model to mark another unchecked.

In the operators we run, the test set is not a launch artefact we file away. It runs continuously against the live agent, because an agent that scored well last month can degrade as the world it acts on shifts.

05The discipline

Evaluation never ends at launch

The biggest mistake is treating evaluation as a gate you pass once. Agents face a moving target: data changes, tools change, the model may be updated, and behaviour drifts. Production evaluation — continuously sampling live runs, scoring them, and watching the metrics that signal drift — is what turns a good launch into a reliable system. An agent that is evaluated only before it ships is trusted on the strength of a snapshot, long after the snapshot stopped being true.

Questions

How do I evaluate an AI agent?
Define success criteria, build a representative test set including hard edge cases, run the agent and capture the full trajectory, score outcome and behaviour against your criteria, categorise failures, and re-run after every change. Then keep evaluating in production.
Why is a good demo not enough to evaluate an agent?
A demo shows the agent can succeed once. Evaluation measures whether it succeeds reliably across the messy distribution of real cases. A polished single run tells you nothing about edge cases or consistency — which is where agents usually fail.
Should I score the final answer or the whole trajectory?
The whole trajectory. An agent acts over many steps, so you check whether it chose the right actions, used tools correctly, stayed within its authority, and escalated when it should. A right outcome reached by a forbidden action is still a failure.
Can I use another model to evaluate an AI agent?
Yes, a model can grade outputs against a rubric at scale, which is useful. But validate that grader against human judgement so you are not trusting one model to mark another unchecked. Combine it with programmatic checks and human review.
Do I need to keep evaluating an agent after launch?
Yes. Data, tools, and the model can change, and behaviour drifts. Continuous production evaluation — sampling live runs, scoring them, and watching drift metrics — is what keeps a good launch a reliable system over time.

Begin

We don't advise on AI. We run it for you.

Book a consultation

Proven on your data before you commit.