Running Evaluations
Run an Evaluation
Section titled “Run an Evaluation”agentv eval evals/my-eval.yamlResults are written to .agentv/results/eval_<timestamp>.jsonl. Each line is a JSON object with one result per test case.
Each scores[] entry includes per-judge timing:
{ "scores": [ { "name": "format_structure", "type": "llm-judge", "score": 0.9, "verdict": "pass", "hits": ["clear structure"], "misses": [], "duration_ms": 9103, "started_at": "2026-03-09T00:05:10.123Z", "ended_at": "2026-03-09T00:05:19.226Z", "token_usage": { "input": 2711, "output": 2535 } } ]}The duration_ms, started_at, and ended_at fields are present on every judge result (including code-judge), enabling per-judge bottleneck analysis.
Common Options
Section titled “Common Options”Override Target
Section titled “Override Target”Run against a different target than specified in the eval file:
agentv eval --target azure-base evals/**/*.yamlRun Specific Test
Section titled “Run Specific Test”Run a single test by ID:
agentv eval --test-id case-123 evals/my-eval.yamlDry Run
Section titled “Dry Run”Test the harness flow with mock responses (does not call real providers):
agentv eval --dry-run evals/my-eval.yamlOutput to Specific File
Section titled “Output to Specific File”agentv eval evals/my-eval.yaml --out results/baseline.jsonlTrace Persistence
Section titled “Trace Persistence”Export execution traces (tool calls, timing, spans) to files for debugging and analysis:
# Human-readable JSONL trace (one record per test case)agentv eval evals/my-eval.yaml --trace-file traces/eval.jsonl
# OTLP JSON trace (importable by OTel backends like Jaeger, Grafana)agentv eval evals/my-eval.yaml --otel-file traces/eval.otlp.json
# Both formats simultaneouslyagentv eval evals/my-eval.yaml --trace-file traces/eval.jsonl --otel-file traces/eval.otlp.jsonThe --trace-file format writes JSONL records containing:
test_id- The test identifiertarget/score- Target and evaluation scoreduration_ms- Total execution durationspans- Array of tool invocations with timingtoken_usage/cost_usd- Resource consumption
The --otel-file format writes standard OTLP JSON that can be imported into any OpenTelemetry-compatible backend.
Workspace Modes and Finish Policy
Section titled “Workspace Modes and Finish Policy”Use workspace mode and finish policies instead of multiple conflicting booleans:
# Mode: pooled | ephemeral | staticagentv eval evals/my-eval.yaml --workspace-mode pooled
# Static mode pathagentv eval evals/my-eval.yaml --workspace-mode static --workspace-path /path/to/workspace
# Pooled reset policy override: standard | full (CLI override)agentv eval evals/my-eval.yaml --workspace-clean full
# Finish policy overrides: keep | cleanup (CLI)agentv eval evals/my-eval.yaml --retain-on-success cleanup --retain-on-failure keepEquivalent eval YAML:
workspace: mode: pooled # pooled | ephemeral | static static_path: null # required when mode=static hooks: after_each: reset: fast # none | fast | strictNotes:
- Pooling is default for shared workspaces with repos when mode is not specified.
mode: static(or--workspace-mode static) requiresstatic_path/--workspace-path.- Static mode is incompatible with
isolation: per_test. - Pool slots are managed separately (
agentv workspace list|clean).
Retry Execution Errors
Section titled “Retry Execution Errors”Re-run only the tests that had infrastructure/execution errors from a previous output:
agentv eval evals/my-eval.yaml --retry-errors .agentv/results/eval_previous.jsonlThis reads the previous JSONL, filters for executionStatus === 'execution_error', and re-runs only those test cases. Non-error results from the previous run are preserved and merged into the new output.
Execution Error Tolerance
Section titled “Execution Error Tolerance”Control whether the eval run halts on execution errors using execution.fail_on_error in the eval YAML:
execution: fail_on_error: false # never halt on errors (default) # fail_on_error: true # halt on first execution error| Value | Behavior |
|---|---|
true | Halt immediately on first execution error |
false | Continue despite errors (default) |
When halted, remaining tests are recorded with failureReasonCode: 'error_threshold_exceeded'. With concurrency > 1, a few additional tests may complete before halting takes effect.
Validate Before Running
Section titled “Validate Before Running”Check eval files for schema errors without executing:
agentv validate evals/my-eval.yamlAgent-Orchestrated Evals
Section titled “Agent-Orchestrated Evals”Run evaluations without API keys by letting an external agent (e.g., Claude Code, Copilot CLI) orchestrate the eval pipeline.
Overview
Section titled “Overview”agentv prompt eval evals/my-eval.yamlOutputs a step-by-step orchestration prompt listing all tests and the commands to run for each.
Get Task Input
Section titled “Get Task Input”agentv prompt eval input evals/my-eval.yaml --test-id case-123Returns JSON with:
input—[{role, content}]array. File references use absolute paths ({type: "file", path: "/abs/path"}) that the agent can read directly from the filesystem.guideline_paths— files containing additional instructions to prepend to the system message.criteria— grading criteria for the orchestrator’s reference (do not pass to the candidate).
Judge the Result
Section titled “Judge the Result”agentv prompt eval judge evals/my-eval.yaml --test-id case-123 --answer-file response.txtRuns code judges deterministically and returns LLM judge prompts for the agent to execute. Each evaluator in the output has a status:
"completed"— Score is final (e.g., code judge). Readresult.score."prompt_ready"— LLM grading required. Sendprompt.system_promptandprompt.user_promptto your LLM and parse the JSON response.
When to Use
Section titled “When to Use”| Scenario | Command |
|---|---|
| Have API keys, want end-to-end automation | agentv eval |
| No API keys, agent can act as the LLM | agentv prompt |
Version Requirements
Section titled “Version Requirements”Declare the minimum AgentV version needed by your eval project in .agentv/config.yaml:
required_version: ">=2.12.0"The value is a semver range using standard npm syntax (e.g., >=2.12.0, ^2.12.0, ~2.12, >=2.12.0 <3.0.0).
| Condition | Interactive (TTY) | Non-interactive (CI) |
|---|---|---|
| Version satisfies range | Runs silently | Runs silently |
| Version below range | Warns + prompts to continue | Warns to stderr, continues |
--strict flag + mismatch | Warns + exits 1 | Warns + exits 1 |
No required_version set | Runs silently | Runs silently |
| Malformed semver range | Error + exits 1 | Error + exits 1 |
Use --strict in CI pipelines to enforce version requirements:
agentv eval --strict evals/my-eval.yamlConfig File Defaults
Section titled “Config File Defaults”Set default execution options so you don’t have to pass them on every CLI invocation. Both .agentv/config.yaml and agentv.config.ts are supported.
YAML config (.agentv/config.yaml)
Section titled “YAML config (.agentv/config.yaml)”execution: verbose: true trace_file: .agentv/results/trace-{timestamp}.jsonl keep_workspaces: false otel_file: .agentv/results/otel-{timestamp}.json| Field | CLI equivalent | Type | Default | Description |
|---|---|---|---|---|
verbose | --verbose | boolean | false | Enable verbose logging |
trace_file | --trace-file | string | none | Write human-readable trace JSONL |
keep_workspaces | --keep-workspaces | boolean | false | Always keep temp workspaces after eval |
otel_file | --otel-file | string | none | Write OTLP JSON trace to file |
TypeScript config (agentv.config.ts)
Section titled “TypeScript config (agentv.config.ts)”import { defineConfig } from '@agentv/core';
export default defineConfig({ execution: { verbose: true, traceFile: '.agentv/results/trace-{timestamp}.jsonl', keepWorkspaces: false, otelFile: '.agentv/results/otel-{timestamp}.json', },});The {timestamp} placeholder is replaced with an ISO-like timestamp (e.g., 2026-03-05T14-30-00-000Z) at execution time.
Precedence: CLI flags > .agentv/config.yaml > agentv.config.ts > built-in defaults.
Environment Variables
Section titled “Environment Variables”AGENTV_HOME
Section titled “AGENTV_HOME”Override the default ~/.agentv directory for all global runtime data (workspaces, git cache, subagents, trace state, version check cache):
# Linux/macOSexport AGENTV_HOME=/data/agentv
# Windows (PowerShell)$env:AGENTV_HOME = "D:\agentv"
# Windows (CMD)set AGENTV_HOME=D:\agentvWhen set, AgentV logs Using AGENTV_HOME: <path> on startup to confirm the override is active.
All Options
Section titled “All Options”Run agentv eval --help for the full list of options including workers, timeouts, output formats, and trace dumping.