Evaluation & Testing

The evaluation harness provides a provider-agnostic pipeline for running Claude Code agent swarms against a target repository, capturing every turn in a normalized schema, extracting structured findings, and (as a separate step) diffing two independent runs bidirectionally ¹. The system is designed to assess model and agent performance by isolating runs by provider, materializing read-only snapshots of the target codebase, and orchestrating the execution of agent suites. Each run produces a structured set of artifacts including normalized telemetry, extracted findings, and parse reports, which are then indexed for later comparison.

Run Orchestration

The core orchestrator is the EvalRunner class, which manages a single-provider capture run ². A single invocation corresponds to one provider (either local or anthropic) and one or more evaluation suites. The runner creates a timestamped run directory under ~/.local/share/autosre/eval-runs and writes a manifest.json that records metadata such as the target repository, SHA, snapshot digest, and suite outcomes ³.

The orchestration loop iterates through the requested suites. For each suite, it materializes a read-only snapshot of the target repository, launches the agent swarm, normalizes the telemetry, extracts findings, and cleans up the snapshot ². If a crash occurs during a later suite, the runner ensures that results from previously completed suites are preserved by writing a partial manifest and appending a partial row to the global runs.jsonl index before re-raising the exception ³.

Suite Execution and Snapshot Management

Each suite execution involves materializing a snapshot of the target repository into a worktree ⁴. The snapshot is enforced to be read-only, and the runner ensures cleanup of the worktree even if subsequent steps fail, unless the --keep-worktrees flag is used.

The runner launches the agent swarm using a SwarmLauncher instance, passing an EvalLaunchSpec that includes the capture directory, worktree path, prompt, system prompt, and configuration limits such as timeout and max tokens. The launcher captures the agent outputs, which are then processed to generate normalized turn records and extracted findings.

Telemetry Normalization

Telemetry from the agent swarm is normalized into a standard TurnRecord schema and written to turns.jsonl ². The normalize_run function handles provider-specific differences, ensuring that telemetry from both local and anthropic providers is stored in a consistent format ⁴. This normalized data is crucial for downstream analysis and debugging, as it provides a complete history of the agent’s interactions with the target repository.

Findings Extraction

Findings are extracted from the agent outputs using the extract_agent_findings function. The runner discovers expected agent output files based on the role_file_map defined in the suite configuration, and also picks up any additional JSON files written by the agents, excluding the lead file which is a summary and not a primary findings source.

For each agent role, the extraction pipeline processes the output file to identify structured findings. These findings are deduplicated by their unique id across all sources to prevent double-counting, especially in cases where tier-4 chat recovery returns the same blob for multiple missing roles. The extracted findings and an extraction report detailing the status of each agent (e.g., success, failed) are written to findings.jsonl and parse_report.json respectively.

Comparison and Diffing

Comparison between two independent runs (e.g., from different providers or models) is handled by a separate invocation, typically via autosre eval compare ². The comparison logic involves bipartite matching of findings between the two runs and applies strict refusal rules to determine agreement ¹.

The differ module is responsible for this bipartite matching, identifying findings that are present in both runs, only in one, or partially matched. The results of the comparison are rendered into a markdown report (compare.md) and a JSON payload, which include metrics such as agreement rates, counts of A-only, B-only, and both findings, and warnings if the comparison is not valid for model-quality assessment .

Reporting

The harness provides rendering capabilities for both single-run and compare results ⁵. The render_single function generates a markdown report for a single run, summarizing the provider, target, SHA, total findings, parse failures, and per-suite details including extraction status and individual findings.

The render_compare function generates a markdown report for a comparison between two runs, highlighting the differences in findings, agreement rates, and any warnings about the validity of the comparison . These reports are written to report.md and compare.md respectively, providing a human-readable summary of the evaluation results, while the JSON files remain the source of truth for programmatic access.

autosre/eval/__init__.py L1-23

"""Auto-SRE eval harness.

This package provides a provider-agnostic pipeline for running Claude Code
agent swarms against a target repository, capturing every turn in a
normalized schema, extracting structured findings, and (as a separate step)
diffing two independent runs bidirectionally.

Subpackages and modules:

- :mod:`autosre.eval.snapshot`  - immutable, filesystem-enforced snapshots
- :mod:`autosre.eval.capture`   - stream-json + proxy log → ``TurnRecord``
- :mod:`autosre.eval.findings`  - ``Finding`` schema + id + normalization
- :mod:`autosre.eval.suite`     - YAML suite loader + ``EvalSuite`` schema
- :mod:`autosre.eval.runner`    - orchestrates a single-provider run
- :mod:`autosre.eval.extract`   - findings extraction pipeline with fallbacks
- :mod:`autosre.eval.differ`    - bipartite matching + strict refusal rules
- :mod:`autosre.eval.judge`     - Opus LLM-as-judge subprocess wrapper
- :mod:`autosre.eval.report`    - markdown / json report rendering
- :mod:`autosre.eval.lenient_json` - tiny JSON cleaner (no new deps)
"""

from __future__ import annotations

autosre/eval/runner.py L1-120 (showing 40 of 120)

"""EvalRunner - single-provider capture orchestration.

One invocation = one provider, N suites. The runner:

1. Creates a timestamped run directory.
2. Writes ``manifest.json``.
3. For each suite: materializes a read-only snapshot, launches Claude
   Code via ``SwarmLauncher`` in eval mode, normalizes telemetry into
   ``turns.jsonl``, extracts findings into ``findings.jsonl``, writes a
   ``parse_report.json``, and cleans up the snapshot.
4. Appends one row to ``runs.jsonl``.

Runs are independent. Running the other provider is a separate
invocation. Diffing them is a third separate invocation
(``autosre eval compare``).
"""

from __future__ import annotations

import json
import time
from dataclasses import dataclass, field
from pathlib import Path
from typing import TYPE_CHECKING, Literal, Protocol

from autosre.eval.capture import TurnRecord, normalize_run, write_turns
from autosre.eval.extract import (
    AgentExtraction,
    ExtractionReport,
    extract_agent_findings,
    write_findings_jsonl,
    write_parse_report,
)
from autosre.eval.snapshot import Snapshot, cleanup, materialize
from autosre.eval.suite import EvalSuite, load_all_suites
from autosre.swarm.launcher import (
    DEFAULT_ANTHROPIC_MODEL,
    CaptureResult,
    EvalLaunchSpec,
)

autosre/eval/runner.py L121-240 (showing 40 of 120)

        *,
        provider: RunProvider,
        suites: list[str],
        target: Path,
        run_id: str | None = None,
        anthropic_model: str = DEFAULT_ANTHROPIC_MODEL,
        allow_dirty: bool = False,
        keep_worktrees: bool = False,
    ) -> RunResult:
        target = target.resolve()
        all_suites = load_all_suites()
        unknown = [s for s in suites if s not in all_suites]
        if unknown:
            raise ValueError(f"unknown suites: {unknown}")

        run_dir, final_run_id = self._make_run_dir(provider, run_id)
        run_dir.mkdir(parents=True, exist_ok=True)
        (run_dir / "suites").mkdir()
        (run_dir / "worktrees").mkdir()

        # First suite materializes to infer SHA/digest for the manifest.
        suite_results: list[SuiteRunResult] = []
        run_target_sha: str | None = None
        run_snapshot_digest: str | None = None

        def build_result() -> RunResult:
            return RunResult(
                run_id=final_run_id,
                run_dir=run_dir,
                provider=provider,
                target_repo=target,
                target_sha=run_target_sha,
                snapshot_digest=run_snapshot_digest,
                suites=suite_results,
            )

        try:
            for suite_name in suites:
                self._run_one_suite(
                    provider=provider,

autosre/eval/runner.py L241-360 (showing 40 of 120)

                proxy_log_path=self.proxy_log_path,
                out_path=turns_path,
            )

            merged_findings, report = self._extract_suite_findings(
                suite=suite,
                agent_outputs=agent_outputs,
                turns_path=turns_path,
                provider=provider,
            )

            write_findings_jsonl(merged_findings, suite_dir / "findings.jsonl")
            write_parse_report(report, suite_dir / "parse_report.json")

            suite_results.append(
                SuiteRunResult(
                    suite=suite_name,
                    findings=merged_findings,
                    report=report,
                    turns=turns,
                    capture=capture,
                    snapshot=snapshot,
                )
            )
        finally:
            if not keep_worktrees:
                cleanup(snapshot)

    # ── Internals ─────────────────────────────────────────────────

    def _make_run_dir(
        self,
        provider: RunProvider,
        run_id: str | None,
    ) -> tuple[Path, str]:
        ts = time.strftime("%Y-%m-%dT%H-%M-%S", time.gmtime(self.clock()))
        tag = run_id or "run"
        final = f"{ts}-{provider}-{tag}"
        return self.runs_root / final, final

autosre/eval/report.py L1-120 (showing 40 of 120)

"""Markdown rendering for eval runs and compares.

Deliberately boring: just enough markdown to let a human scan a
``report.md`` or ``compare.md`` without opening JSON. The source of
truth is always the JSON next to the markdown - the markdown is a
courtesy.
"""

from __future__ import annotations

import json
from typing import TYPE_CHECKING

if TYPE_CHECKING:
    from pathlib import Path

    from autosre.eval.differ import CompareResult
    from autosre.eval.runner import RunResult

# ── Single-run report ─────────────────────────────────────────────


def render_single(result: RunResult) -> str:
    """Render the per-run ``report.md`` for one capture run."""
    lines: list[str] = []
    lines.append(f"# Eval run: {result.run_id}")
    lines.append("")
    lines.append(f"- Provider: **{result.provider}**")
    lines.append(f"- Target: `{result.target_repo}`")
    lines.append(f"- SHA: `{result.target_sha or 'n/a'}`")
    lines.append(f"- Snapshot digest: `{result.snapshot_digest or 'n/a'}`")
    lines.append("")

    total_findings = sum(len(sr.findings) for sr in result.suites)
    total_failures = sum(
        1 for sr in result.suites for a in sr.report.agents if a.status == "failed"
    )
    lines.append(f"**Findings:** {total_findings}")
    lines.append(f"**Parse failures:** {total_failures}")
    lines.append("")