Skip to content

Evaluation & Testing

The evaluation harness provides a provider-agnostic pipeline for running Claude Code agent swarms against a target repository, capturing every turn in a normalized schema, extracting structured findings, and (as a separate step) diffing two independent runs bidirectionally 1. The system is designed to assess model and agent performance by isolating runs by provider, materializing read-only snapshots of the target codebase, and orchestrating the execution of agent suites. Each run produces a structured set of artifacts including normalized telemetry, extracted findings, and parse reports, which are then indexed for later comparison.

The core orchestrator is the EvalRunner class, which manages a single-provider capture run 2. A single invocation corresponds to one provider (either local or anthropic) and one or more evaluation suites. The runner creates a timestamped run directory under ~/.local/share/autosre/eval-runs and writes a manifest.json that records metadata such as the target repository, SHA, snapshot digest, and suite outcomes 3.

The orchestration loop iterates through the requested suites. For each suite, it materializes a read-only snapshot of the target repository, launches the agent swarm, normalizes the telemetry, extracts findings, and cleans up the snapshot 2. If a crash occurs during a later suite, the runner ensures that results from previously completed suites are preserved by writing a partial manifest and appending a partial row to the global runs.jsonl index before re-raising the exception 3.

diagram

Each suite execution involves materializing a snapshot of the target repository into a worktree 4. The snapshot is enforced to be read-only, and the runner ensures cleanup of the worktree even if subsequent steps fail, unless the --keep-worktrees flag is used.

The runner launches the agent swarm using a SwarmLauncher instance, passing an EvalLaunchSpec that includes the capture directory, worktree path, prompt, system prompt, and configuration limits such as timeout and max tokens. The launcher captures the agent outputs, which are then processed to generate normalized turn records and extracted findings.

Telemetry from the agent swarm is normalized into a standard TurnRecord schema and written to turns.jsonl 2. The normalize_run function handles provider-specific differences, ensuring that telemetry from both local and anthropic providers is stored in a consistent format 4. This normalized data is crucial for downstream analysis and debugging, as it provides a complete history of the agent’s interactions with the target repository.

Findings are extracted from the agent outputs using the extract_agent_findings function. The runner discovers expected agent output files based on the role_file_map defined in the suite configuration, and also picks up any additional JSON files written by the agents, excluding the lead file which is a summary and not a primary findings source.

For each agent role, the extraction pipeline processes the output file to identify structured findings. These findings are deduplicated by their unique id across all sources to prevent double-counting, especially in cases where tier-4 chat recovery returns the same blob for multiple missing roles. The extracted findings and an extraction report detailing the status of each agent (e.g., success, failed) are written to findings.jsonl and parse_report.json respectively.

Comparison between two independent runs (e.g., from different providers or models) is handled by a separate invocation, typically via autosre eval compare 2. The comparison logic involves bipartite matching of findings between the two runs and applies strict refusal rules to determine agreement 1.

The differ module is responsible for this bipartite matching, identifying findings that are present in both runs, only in one, or partially matched. The results of the comparison are rendered into a markdown report (compare.md) and a JSON payload, which include metrics such as agreement rates, counts of A-only, B-only, and both findings, and warnings if the comparison is not valid for model-quality assessment .

The harness provides rendering capabilities for both single-run and compare results 5. The render_single function generates a markdown report for a single run, summarizing the provider, target, SHA, total findings, parse failures, and per-suite details including extraction status and individual findings.

The render_compare function generates a markdown report for a comparison between two runs, highlighting the differences in findings, agreement rates, and any warnings about the validity of the comparison . These reports are written to report.md and compare.md respectively, providing a human-readable summary of the evaluation results, while the JSON files remain the source of truth for programmatic access.