Skip to content

Running Evaluations

Terminal window
agentv eval evals/my-eval.yaml

Results are written to .agentv/results/eval_<timestamp>.jsonl. Each line is a JSON object with one result per test case.

Each scores[] entry includes per-judge timing:

{
"scores": [
{
"name": "format_structure",
"type": "llm-judge",
"score": 0.9,
"verdict": "pass",
"hits": ["clear structure"],
"misses": [],
"duration_ms": 9103,
"started_at": "2026-03-09T00:05:10.123Z",
"ended_at": "2026-03-09T00:05:19.226Z",
"token_usage": { "input": 2711, "output": 2535 }
}
]
}

The duration_ms, started_at, and ended_at fields are present on every judge result (including code-judge), enabling per-judge bottleneck analysis.

Run against a different target than specified in the eval file:

Terminal window
agentv eval --target azure-base evals/**/*.yaml

Run a single test by ID:

Terminal window
agentv eval --test-id case-123 evals/my-eval.yaml

Test the harness flow with mock responses (does not call real providers):

Terminal window
agentv eval --dry-run evals/my-eval.yaml
Terminal window
agentv eval evals/my-eval.yaml --out results/baseline.jsonl

Export execution traces (tool calls, timing, spans) to files for debugging and analysis:

Terminal window
# Human-readable JSONL trace (one record per test case)
agentv eval evals/my-eval.yaml --trace-file traces/eval.jsonl
# OTLP JSON trace (importable by OTel backends like Jaeger, Grafana)
agentv eval evals/my-eval.yaml --otel-file traces/eval.otlp.json
# Both formats simultaneously
agentv eval evals/my-eval.yaml --trace-file traces/eval.jsonl --otel-file traces/eval.otlp.json

The --trace-file format writes JSONL records containing:

  • test_id - The test identifier
  • target / score - Target and evaluation score
  • duration_ms - Total execution duration
  • spans - Array of tool invocations with timing
  • token_usage / cost_usd - Resource consumption

The --otel-file format writes standard OTLP JSON that can be imported into any OpenTelemetry-compatible backend.

Use workspace mode and finish policies instead of multiple conflicting booleans:

Terminal window
# Mode: pooled | ephemeral | static
agentv eval evals/my-eval.yaml --workspace-mode pooled
# Static mode path
agentv eval evals/my-eval.yaml --workspace-mode static --workspace-path /path/to/workspace
# Pooled reset policy override: standard | full (CLI override)
agentv eval evals/my-eval.yaml --workspace-clean full
# Finish policy overrides: keep | cleanup (CLI)
agentv eval evals/my-eval.yaml --retain-on-success cleanup --retain-on-failure keep

Equivalent eval YAML:

workspace:
mode: pooled # pooled | ephemeral | static
static_path: null # required when mode=static
hooks:
after_each:
reset: fast # none | fast | strict

Notes:

  • Pooling is default for shared workspaces with repos when mode is not specified.
  • mode: static (or --workspace-mode static) requires static_path / --workspace-path.
  • Static mode is incompatible with isolation: per_test.
  • Pool slots are managed separately (agentv workspace list|clean).

Re-run only the tests that had infrastructure/execution errors from a previous output:

Terminal window
agentv eval evals/my-eval.yaml --retry-errors .agentv/results/eval_previous.jsonl

This reads the previous JSONL, filters for executionStatus === 'execution_error', and re-runs only those test cases. Non-error results from the previous run are preserved and merged into the new output.

Control whether the eval run halts on execution errors using execution.fail_on_error in the eval YAML:

execution:
fail_on_error: false # never halt on errors (default)
# fail_on_error: true # halt on first execution error
ValueBehavior
trueHalt immediately on first execution error
falseContinue despite errors (default)

When halted, remaining tests are recorded with failureReasonCode: 'error_threshold_exceeded'. With concurrency > 1, a few additional tests may complete before halting takes effect.

Check eval files for schema errors without executing:

Terminal window
agentv validate evals/my-eval.yaml

Run evaluations without API keys by letting an external agent (e.g., Claude Code, Copilot CLI) orchestrate the eval pipeline.

Terminal window
agentv prompt eval evals/my-eval.yaml

Outputs a step-by-step orchestration prompt listing all tests and the commands to run for each.

Terminal window
agentv prompt eval input evals/my-eval.yaml --test-id case-123

Returns JSON with:

  • input[{role, content}] array. File references use absolute paths ({type: "file", path: "/abs/path"}) that the agent can read directly from the filesystem.
  • guideline_paths — files containing additional instructions to prepend to the system message.
  • criteria — grading criteria for the orchestrator’s reference (do not pass to the candidate).
Terminal window
agentv prompt eval judge evals/my-eval.yaml --test-id case-123 --answer-file response.txt

Runs code judges deterministically and returns LLM judge prompts for the agent to execute. Each evaluator in the output has a status:

  • "completed" — Score is final (e.g., code judge). Read result.score.
  • "prompt_ready" — LLM grading required. Send prompt.system_prompt and prompt.user_prompt to your LLM and parse the JSON response.
ScenarioCommand
Have API keys, want end-to-end automationagentv eval
No API keys, agent can act as the LLMagentv prompt

Declare the minimum AgentV version needed by your eval project in .agentv/config.yaml:

required_version: ">=2.12.0"

The value is a semver range using standard npm syntax (e.g., >=2.12.0, ^2.12.0, ~2.12, >=2.12.0 <3.0.0).

ConditionInteractive (TTY)Non-interactive (CI)
Version satisfies rangeRuns silentlyRuns silently
Version below rangeWarns + prompts to continueWarns to stderr, continues
--strict flag + mismatchWarns + exits 1Warns + exits 1
No required_version setRuns silentlyRuns silently
Malformed semver rangeError + exits 1Error + exits 1

Use --strict in CI pipelines to enforce version requirements:

Terminal window
agentv eval --strict evals/my-eval.yaml

Set default execution options so you don’t have to pass them on every CLI invocation. Both .agentv/config.yaml and agentv.config.ts are supported.

execution:
verbose: true
trace_file: .agentv/results/trace-{timestamp}.jsonl
keep_workspaces: false
otel_file: .agentv/results/otel-{timestamp}.json
FieldCLI equivalentTypeDefaultDescription
verbose--verbosebooleanfalseEnable verbose logging
trace_file--trace-filestringnoneWrite human-readable trace JSONL
keep_workspaces--keep-workspacesbooleanfalseAlways keep temp workspaces after eval
otel_file--otel-filestringnoneWrite OTLP JSON trace to file
import { defineConfig } from '@agentv/core';
export default defineConfig({
execution: {
verbose: true,
traceFile: '.agentv/results/trace-{timestamp}.jsonl',
keepWorkspaces: false,
otelFile: '.agentv/results/otel-{timestamp}.json',
},
});

The {timestamp} placeholder is replaced with an ISO-like timestamp (e.g., 2026-03-05T14-30-00-000Z) at execution time.

Precedence: CLI flags > .agentv/config.yaml > agentv.config.ts > built-in defaults.

Override the default ~/.agentv directory for all global runtime data (workspaces, git cache, subagents, trace state, version check cache):

Terminal window
# Linux/macOS
export AGENTV_HOME=/data/agentv
# Windows (PowerShell)
$env:AGENTV_HOME = "D:\agentv"
# Windows (CMD)
set AGENTV_HOME=D:\agentv

When set, AgentV logs Using AGENTV_HOME: <path> on startup to confirm the override is active.

Run agentv eval --help for the full list of options including workers, timeouts, output formats, and trace dumping.