Source Indexing

The Source Indexing subsystem transforms the Python source tree into a structured, queryable SQLite database to provide deterministic grounding for the autoswe planner. The pipeline operates in four distinct phases: a deterministic AST scan extracts file metadata and symbols; a local LLM generates tiered summaries for code context; a synthesis step aggregates these summaries into module-level overviews; and a final mapping phase resolves Go modules to their Python counterparts using a strict hierarchy of overrides, rules, and fuzzy matching. This architecture ensures that the planner receives instant, complete context without needing to traverse the file system or rely on external API calls during the rewrite loop.

Deterministic Source Scanning

The initial phase, handled by scripts/source_index/scanner.py, performs a content-hash-aware walk of the Python source tree to extract structural metadata without invoking any models ¹. The scanner identifies all .py files, skipping __pycache__ directories, and computes a SHA256 hash for each file to detect changes. Using Python’s standard ast module, it parses the source code to extract classes (including bases, decorators, and methods), functions, and top-level constants.

The scanner maintains a source_files table in the SQLite database, storing file paths, sizes, line counts, and JSON-serialized symbol lists. It employs a generation-based update strategy: if a file’s content hash and summary state remain unchanged from the previous generation, the scanner preserves existing tier1 results to avoid redundant processing. Files with changed content or no prior summary are marked with summary_state='pending'. This phase is entirely deterministic and model-free, ensuring stable inputs for subsequent steps.

Local LLM Summarization

Once the AST scan populates the database with pending files, the system invokes a local LLM to generate contextual summaries. This process is managed by scripts/source_index/summarize.py and is triggered by the scan command unless the --deterministic flag is set ². The --deterministic mode, also implied by the AUTOSWE_RUN_CONFIG environment variable, skips LLM inference entirely, relying solely on AST symbols and Go documentation for grounding.

When LLM inference is active, the system generates “tier1” summaries for individual source files and “tier2” summaries for module directories. These summaries are stored in the source_files and source_modules tables respectively, with their summary_state updated to ‘ok’ upon successful generation. The use of a local model (Qwen3.6) ensures that this step does not depend on external API endpoints, maintaining privacy and reducing latency ³.

Module Synthesis

The synthesis phase, implemented in scripts/source_index/modules.py, aggregates file-level information into module-level overviews ². This step produces “tier2” summaries, which provide a higher-level context for directories containing Python files. The synthesis process queries the database for files with summary_state='ok' and combines their tier1 summaries to form a cohesive module description.

This aggregated context is crucial for the mapping phase, as it provides the planner with a broader understanding of module responsibilities and interactions. The synthesized data is stored in the source_modules table, linked to the current generation. By pre-computing these summaries, the system avoids the need for the planner to infer module structure from raw code during planning iterations.

Go-to-Python Module Mapping

The final phase, handled by scripts/source_index/mapping.py, resolves each Go module to its corresponding Python source files ⁴. This mapping is critical for the planner to understand which Python code is affected by Go changes. The resolution follows a strict priority order: manual overrides, deterministic rules, and fuzzy matching.

Manual overrides are loaded from acceptance/module-map.yaml and take precedence over all other methods. If no override exists, the system applies deterministic rules, such as mapping internal/adapters/<name> to tools/<name>.py or internal/app/<name> to commands/<name>.py. These rules have high confidence scores (0.9-0.95) and are source-labeled as “rule”.

For modules not covered by rules, the system employs a fuzzy matching algorithm. This algorithm scores source files based on name similarity in paths, symbols, and tier1 summaries. If the top candidate’s score exceeds a confidence threshold (0.75) and is sufficiently higher than the second candidate (margin of 0.15), the mapping is marked as confident. Otherwise, the module is marked as “ambiguous,” exposing candidates to the planner for manual confirmation. This approach prevents incorrect mappings from propagating into the planning phase.

scripts/source_index/scanner.py L1-120 (showing 40 of 120)

"""Phase A - deterministic Python source scanner (no model).

Walks the Python source tree, extracts per-file metadata + symbols + imports
via the stdlib `ast` module. Exact, free, no GPU. Populates source_files with
everything except the tier1_* columns (left summary_state='pending').
"""
from __future__ import annotations

import ast
import hashlib
import json
import sqlite3
import time
from pathlib import Path

from autoswe_cli.runctx import RunContext

SOURCE_ROOT = RunContext.active_or_legacy().source_root  # run-scoped Python source tree


def iter_py_files(root: Path):
    for p in sorted(root.rglob("*.py")):
        if "__pycache__" in p.parts:
            continue
        yield p


def _sha256(path: Path) -> str:
    h = hashlib.sha256()
    h.update(path.read_bytes())
    return h.hexdigest()


def extract_symbols(src: str) -> tuple[list[dict], list[str]]:
    """Return (symbols, imports) via ast. Symbols: classes (with bases +
    methods), top-level functions, dataclass detection, top-level constants."""
    symbols: list[dict] = []
    imports: list[str] = []
    try:
        tree = ast.parse(src)

scripts/source_index/cli.py L1-120 (showing 40 of 120)

"""source-index CLI - the interface the planner/ACCEPT-writeback call.

Invoke via `python3 scripts/source-index.py` (the entry shim bootstraps
sys.path; no PYTHONPATH, no bash wrapper). Commands:
  lookup <go-module>     grounding for one module (the planner's call)
  scan [--full]          build-once: A deterministic + B/C summaries + D mapping
  scan-source [--full]   operator pre-step: scan + mapping audit + stats
  remap                  re-apply module->Python mapping after editing overrides
  audit                  mapping audit only, no rescan (parity-module coverage)
  target-reindex <mod>   deterministic Go symbol index after an ACCEPT
  freshness              content-hash staleness check
  stats                  coverage
"""
from __future__ import annotations

import argparse
import json
import sys

from . import db, scanner, summarize, modules, mapping, freshness, target


def _current_gen(con):
    return db.current_generation(con)


def cmd_scan(args) -> int:
    # Grounding model policy: the tier1/tier2 SUMMARIES (summarize/modules -> llm.py) use
    # the local model. A config run uses DETERMINISTIC-ONLY grounding (no model): it runs
    # the model-free AST scan + module mapping and SKIPS the summaries. This decouples a
    # config run from any model/endpoint entirely (the planner grounds on AST symbols +
    # go-doc). `--deterministic` forces it; AUTOSWE_RUN_CONFIG implies it.
    import os
    deterministic = bool(getattr(args, "deterministic", False)) or bool(os.environ.get("AUTOSWE_RUN_CONFIG"))
    db.init_db()
    con = db.connect()
    gen = _current_gen(con) + 1
    a = scanner.scan(con, scanner.SOURCE_ROOT, gen)
    if deterministic:
        b = c = {"skipped": "deterministic-only grounding (no model)"}

scripts/source_index/__init__.py L1-10

"""source_index - build-once metadata index for the autoswe rewrite.

Scans the Python sddcinfo source ONCE into a local sqlite index so the
planner/reviewer/writer get instant, complete grounding instead of wandering
the source tree every iteration. Self-contained on GB10: local Qwen3.6 (not
Anthropic), sqlite (not D1), no external-CLI dependency.

See the internal design notes for the design.
"""

scripts/source_index/mapping.py L1-120 (showing 40 of 120)

"""Phase D - module -> Python mapping (auditable, override-able).

Never asserts a confident-but-wrong answer (the exact failure this index
removes). Resolution order: manual overrides > deterministic rules >
fuzzy-with-confidence. Low-confidence/ambiguous results expose candidates so
the planner confirms instead of trusting one wrong file.
"""
from __future__ import annotations

import json
import re
import sqlite3
from pathlib import Path

from autoswe_cli.runctx import RunContext

# Run-scoped: the target bundle's matrix/module-map + the Python source tree.
_ctx = RunContext.active_or_legacy()
REPO = _ctx.autoswe_repo
PARITY = _ctx.parity
OVERRIDES = _ctx.module_map
SOURCE_ROOT = _ctx.source_root

CONF_HIGH = 0.75
CONF_MARGIN = 0.15  # if top two are within this, call it ambiguous


def _yaml_load(p: Path):
    import yaml
    return yaml.safe_load(p.read_text()) if p.exists() else None


def _parity_modules() -> list[str]:
    data = _yaml_load(PARITY) or {}
    return [m["id"] for m in data.get("modules", []) if "id" in m]


def _overrides() -> dict:
    data = _yaml_load(OVERRIDES) or {}
    return data.get("modules", {}) if isinstance(data, dict) else {}