Skip to content

Source Indexing

The Source Indexing subsystem transforms the Python source tree into a structured, queryable SQLite database to provide deterministic grounding for the autoswe planner. The pipeline operates in four distinct phases: a deterministic AST scan extracts file metadata and symbols; a local LLM generates tiered summaries for code context; a synthesis step aggregates these summaries into module-level overviews; and a final mapping phase resolves Go modules to their Python counterparts using a strict hierarchy of overrides, rules, and fuzzy matching. This architecture ensures that the planner receives instant, complete context without needing to traverse the file system or rely on external API calls during the rewrite loop.

The initial phase, handled by scripts/source_index/scanner.py, performs a content-hash-aware walk of the Python source tree to extract structural metadata without invoking any models 1. The scanner identifies all .py files, skipping __pycache__ directories, and computes a SHA256 hash for each file to detect changes. Using Python’s standard ast module, it parses the source code to extract classes (including bases, decorators, and methods), functions, and top-level constants.

The scanner maintains a source_files table in the SQLite database, storing file paths, sizes, line counts, and JSON-serialized symbol lists. It employs a generation-based update strategy: if a file’s content hash and summary state remain unchanged from the previous generation, the scanner preserves existing tier1 results to avoid redundant processing. Files with changed content or no prior summary are marked with summary_state='pending'. This phase is entirely deterministic and model-free, ensuring stable inputs for subsequent steps.

Once the AST scan populates the database with pending files, the system invokes a local LLM to generate contextual summaries. This process is managed by scripts/source_index/summarize.py and is triggered by the scan command unless the --deterministic flag is set 2. The --deterministic mode, also implied by the AUTOSWE_RUN_CONFIG environment variable, skips LLM inference entirely, relying solely on AST symbols and Go documentation for grounding.

When LLM inference is active, the system generates “tier1” summaries for individual source files and “tier2” summaries for module directories. These summaries are stored in the source_files and source_modules tables respectively, with their summary_state updated to ‘ok’ upon successful generation. The use of a local model (Qwen3.6) ensures that this step does not depend on external API endpoints, maintaining privacy and reducing latency 3.

The synthesis phase, implemented in scripts/source_index/modules.py, aggregates file-level information into module-level overviews 2. This step produces “tier2” summaries, which provide a higher-level context for directories containing Python files. The synthesis process queries the database for files with summary_state='ok' and combines their tier1 summaries to form a cohesive module description.

This aggregated context is crucial for the mapping phase, as it provides the planner with a broader understanding of module responsibilities and interactions. The synthesized data is stored in the source_modules table, linked to the current generation. By pre-computing these summaries, the system avoids the need for the planner to infer module structure from raw code during planning iterations.

The final phase, handled by scripts/source_index/mapping.py, resolves each Go module to its corresponding Python source files 4. This mapping is critical for the planner to understand which Python code is affected by Go changes. The resolution follows a strict priority order: manual overrides, deterministic rules, and fuzzy matching.

Manual overrides are loaded from acceptance/module-map.yaml and take precedence over all other methods. If no override exists, the system applies deterministic rules, such as mapping internal/adapters/<name> to tools/<name>.py or internal/app/<name> to commands/<name>.py. These rules have high confidence scores (0.9-0.95) and are source-labeled as “rule”.

For modules not covered by rules, the system employs a fuzzy matching algorithm. This algorithm scores source files based on name similarity in paths, symbols, and tier1 summaries. If the top candidate’s score exceeds a confidence threshold (0.75) and is sufficiently higher than the second candidate (margin of 0.15), the mapping is marked as confident. Otherwise, the module is marked as “ambiguous,” exposing candidates to the planner for manual confirmation. This approach prevents incorrect mappings from propagating into the planning phase.

diagram