[ MEASURE / TRAJECTORY_EVALUATION ]

What is AI agent trajectory evaluation?

[ TL;DR ]

Trajectory evaluation scores an AI agent against the actual sequence of decisions it took to reach an outcome — not just the outcome itself. Centurian’s Measure product runs continuous trajectory evaluation, doc-to-eval test generation, trajectory anomaly clustering, and regression detection across every registered agent. The engine catches agents that get the right answer for the wrong reason — the silent failure mode that pure outcome-checking and pure observability tools miss.

Outcome-checking is not enough

A freight-broker agent that quotes the right rate using the wrong base table is a liability waiting to fire. A claims-triage agent that resolves a ticket by hallucinating a refund authorization is a regulatory event. An autonomous payment agent that pays the right invoice from the wrong wallet is a FATF Travel Rule violation. Outcome-only eval gives all three a passing grade.

Centurian scores the path. Every trajectory is a sequence of (tool call, input, output, side effect). The eval engine combines deterministic checks (did the agent call the allowlisted tool? did the output respect the schema? did the side effect happen inside the policy envelope?) with LLM-as-judge (does the reasoning chain hold up against the framework rule?). Both run continuously, on every registered agent, against a signed eval corpus.

Four eval acquisition paths

[ PATH_01 / STARTER_TEMPLATES ]

8+ vertical-aware eval templates ship at launch (T&L EDI sequences, financial fraud, claims triage, customer support, RAG citation, code review, sales follow-up, marketing brief). Clone, configure, score in <60s.

[ PATH_02 / PLAIN_ENGLISH_VIA_MCP ]

Declare the eval in plain English via the centurian.declare_eval MCP tool. Centurian’s LLM compiles intent into the deterministic + judge layers, runs against the signed corpus, signs and ships.

[ PATH_03 / ACTIVE_PROMPTS_AT_14_DAYS ]

If a registered agent has no eval at 14 days, Centurian generates draft tests from observed trajectories and prompts the owner to confirm. The agent gets quality coverage without anyone writing tests by hand.

[ PATH_04 / PERIODIC_AUDIT_JOB ]

Compliance Auditor role runs targeted-deep audits against the trajectory store. Ed25519-signed PDF + JSON output. Custom framework support via the Prove marketplace.

vs every observability tool

ToolAudienceTrajectory evalTied to compliance frameworks
Langfuse / Helicone / AgentOpsSoftware engineersSession replays, not trajectory evalNo
Braintrust / Vellum / GalileoML/AI engineersOutcome eval, batch-modeNo
Credo AI / GeordieCompliance officersNoDocumentation, not enforcement
Salesforce Agent FabricSF adminsNoNo
Centurian MeasureOperators + compliance + financeYes — continuous, on every registered agentYes — tied to Prove framework distribution

Trajectory anomaly detection

Beyond pass/fail eval, Centurian clusters agent trajectories by (purpose, team, platform). The cluster is the agent’s normal shape. New tool calls, new recipient wallets, new latency profile, new failure mode — any drift outside the cluster surfaces as an anomaly. Combined with cost-spine deltas (token burn rate) and audit-spine deltas (new data access patterns), the agent gets a Red / Amber / Green posture per framework.

Industry benchmarks ($249/mo per industry tier) ship the median trajectory eval score across anonymized peer operators. Mid-market 3PLs benchmark freight-broker agents against the 3PL median. Financial services benchmark fraud-detection agents against peer firms. No identity leaves an org — but the percentile signal across thousands of registered agents is the kind of context that reframes a quarterly review.

FAQ

What is AI agent trajectory evaluation?

+
Trajectory evaluation scores an AI agent against the actual sequence of decisions it took to reach an outcome — not just the outcome itself. It catches agents that get the right answer for the wrong reason: ones that hallucinated and recovered, took non-deterministic paths through tools, or made unauthorized side calls. Centurian runs continuous trajectory eval across every registered agent using LLM-as-judge plus deterministic checks tied to a signed eval corpus.

What is doc-to-eval test generation?

+
Doc-to-eval converts an agent's documentation, runbook, or product spec into an executable evaluation suite. Point Centurian at a Confluence page, a README, or a plain-English description; the engine extracts intent, decision points, and acceptance criteria, then writes test cases the agent can be scored against. Four eval acquisition paths cover the rest: starter templates (8+ at launch), plain-English declaration via MCP, active prompts at 14 days post-registration, and a periodic audit job.

How does Centurian detect drifting or lying AI agents?

+
Trajectory anomaly detection clusters agent runs by (purpose, team, platform). When an agent's trajectory shifts outside its cluster — new tool calls, new recipients, new latency profile, new failure shape — Centurian flags it. Continuous regression detection watches for quality drops against the eval suite. Combined with the cost spine (token spike) and audit spine (new data access), drift becomes visible before it is incident-grade.

How is Centurian's Measure product different from Langfuse, Helicone, or AgentOps?

+
Langfuse, Helicone, and AgentOps are developer-focused observability tools — session replays, prompt management, latency monitoring, cost-per-LLM-call. They target software engineers debugging code. Centurian's Measure is policy-grade quality evaluation tied to the Govern and Prove products on one data spine. Operators (COOs, compliance officers, finance teams) see whether an agent passed regression on the EU AI Act framework, not whether prompt 27 had a 3.2-second p95.

What are industry benchmarks?

+
Centurian publishes anonymized cross-operator benchmarks per industry ($249/mo per industry tier). Mid-market 3PLs benchmark their freight-broker AI agents against the median 3PL. Mid-market financial services benchmark fraud-detection agents against peer firms. The benchmarks ship with quality, cost, and policy-violation percentiles. No agent identity, no operator name, no PII leaves an org — but the median trajectory eval score for a load-tendering agent in T&L is the kind of signal you cannot generate alone.

Does trajectory evaluation work for autonomous agents and customer-MCP agents?

+
Yes. The same trajectory eval engine scores Human-only operations, HITL-Chat operations, Centurian's Autonomous-Narrow agent, and customer-MCP agents (Claude Code, ChatGPT-with-MCP, Cursor). Every action is actor-tagged so regressions attribute correctly. The eval suite for Autonomous-Narrow is stricter than Human-only by design: a quality floor is part of the per-framework opt-in.
Get early access →

First agent free, forever · No credit card