Deployment & Infrastructure

The autoswe deployment relies on a combination of systemd units and Kubernetes configurations to manage the autonomous loop, site generation, and system health monitoring. The core architecture separates concerns into distinct services: a static HTTP server for the published site, a synthesis pipeline for regenerating content, a meta-judge for external review, a system-state monitor for health metrics, and a watchdog for liveness checks. These components are orchestrated via systemd timers and path triggers to ensure continuous operation and rapid response to state changes.

Static Site Server

The autoswe-journey-server.service runs a simple Python HTTP server that serves the published static site in “publish mode” ¹. It binds to 0.0.0.0 on port 8765 and serves only the leak-scanned static dist/ directory and snapshots. Raw debug surfaces such as /api/journey, /raw, and /status return 404 errors because the frontend no longer fetches from /api. The service restarts on failure with a 10-second delay and logs to the journal.

Site Synthesis Pipeline

Site regeneration is handled by the autoswe-journey-synth.service, which is a oneshot service that regenerates site-data.json and per-iteration content ². It executes scripts/synthesize-journey.sh with flags --public, --no-codex-cleanup, and --rebuild-site. The --public flag enforces strict, fail-closed redaction, while --no-codex-cleanup ensures structured data is preserved and keeps the local model out of the regeneration path. The service runs with a nice value of 10 and sets JOURNEY_PUBLISH=1.

This synthesis is triggered periodically by the autoswe-journey-synth.timer, which fires every 15 minutes after an initial 2-minute delay on boot ³. The timer has an accuracy of 30 seconds.

Additionally, a file-system trigger mechanism exists via autoswe-synth-bridge.path and autoswe-synth-bridge.service ⁴ ⁵. The path unit watches for the creation of /data/autoswe/synth-trigger ⁴. When this file appears, the bridge service is triggered. The bridge service first removes the trigger file to coalesce re-triggers during execution, then runs the same synthesis pipeline as the timer ⁵. It has a timeout of 600 seconds and runs with a nice value of 10.

Meta-Judge

The autoswe-meta-judge.service runs an external Codex review of the loop behavior ⁶. It executes deploy/codex_meta_judge.py as a oneshot service. The autoswe-meta-judge.timer fires this service every 30 minutes after a 5-minute boot delay, with an accuracy of 1 minute ⁷.

The meta-judge uses deploy/codex-rubric.py to invoke the Codex CLI non-interactively ⁸. This wrapper passes the prompt as an argument and closes stdin to prevent blocking on “Reading additional input from stdin”. It uses the gpt-5.5 model with high reasoning effort and has a timeout of 420 seconds. The script extracts the last “codex” section from the output, falling back to the whole output if markers are absent.

System State Monitor

The autoswe-monitor.service captures time-series health and performance data ⁹. It executes deploy/system_state.py with the --log flag, which appends a flat record to system-state.jsonl ¹⁰. The monitor reads environment variables from ~/.config/autoswe/cluster.env if present, falling back to http://127.0.0.1:8010 for the vLLM URL ⁹ ¹⁰. The service runs with a nice value of 15 ⁹.

The autoswe-monitor.timer fires the monitor every 3 minutes after a 1-minute boot delay, with an accuracy of 15 seconds ¹¹.

The system_state.py script performs several checks:

Loop: Checks state.json and state-heartbeat.json for iteration status and heartbeat age ¹⁰. A heartbeat older than 900 seconds is CRIT, and older than 600 seconds is WARN.
Subagents: Checks for lingering subagents exceeding their phase budgets (planner ~10m, writer ~15m, judge ~12m).
vLLM: Queries /metrics for running/waiting requests and calculates live tokens per second (TPS). A stall is detected if requests are running but TPS is 0 for 6 seconds.
GPU: Uses nvidia-smi to check utilization, clocks, power, temperature, and throttling.
Parity: Parses acceptance/parity-matrix.yaml to count green, red, manual, and n/a states.
Verdicts: Reads the last 5 lines of verdicts.jsonl.
Site: Checks the age of journey-web/dist/index.html and syncs with site-data.json.
Codex Review: Checks the age and health score of the latest meta-judge verdict.

Liveness Watchdog

The autoswe-watchdog.service is a oneshot service that runs deploy/liveness-watchdog.py ¹². It is designed to restart the main autoswe.service if the state.json heartbeat is stale. The autoswe-watchdog.timer fires this service every 5 minutes after a 2-minute boot delay, with an accuracy of 30 seconds ¹³.

deploy/autoswe-journey-server.service L1-21

[Unit]
Description=autoswe journey HTTP server - serves the published static site (publish mode)
Documentation=https://github.com/sddcinfo/autoswe

[Service]
Type=simple
WorkingDirectory=@AUTOSWE_REPO_ROOT@
Environment="PATH=%h/.local/share/mise/shims:%h/.local/bin:/usr/local/bin:/usr/bin:/bin"
# --public: serve ONLY the leak-scanned static dist/ (+ snapshots); the raw
# debug surfaces (/api/journey, /raw, /status) are 404. The frontend no longer
# fetches /api, so nothing breaks.
ExecStart=/usr/bin/python3 @AUTOSWE_REPO_ROOT@/scripts/journey_server.py --bind 0.0.0.0 --port 8765 --public
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=autoswe-journey-server

[Install]
WantedBy=default.target

deploy/autoswe-journey-synth.service L1-18

[Unit]
Description=autoswe journey synthesizer - regenerates site-data.json + per-iter content and rebuilds the published site (leak-gated)
Documentation=https://github.com/sddcinfo/autoswe

[Service]
Type=oneshot
# PATH includes mise shims so --rebuild-site can invoke npm/node. --public
# (strict, fail-closed redaction) + --no-codex-cleanup (structured data needs no
# AI-speak strip and it keeps the local model out of the regen path).
Environment="PATH=%h/.local/share/mise/shims:%h/.local/bin:/usr/local/bin:/usr/bin:/bin"
Environment="HOME=%h"
Environment="JOURNEY_PUBLISH=1"
ExecStart=/bin/bash @AUTOSWE_REPO_ROOT@/scripts/synthesize-journey.sh synthesize --public --no-codex-cleanup --rebuild-site
StandardOutput=journal
StandardError=journal
SyslogIdentifier=autoswe-journey-synth
Nice=10

deploy/autoswe-journey-synth.timer L1-13

[Unit]
Description=autoswe journey synthesizer - fires every 15 min to refresh aggregate docs + 8765 site
Documentation=https://github.com/sddcinfo/autoswe

[Timer]
OnBootSec=2min
OnUnitActiveSec=15min
Unit=autoswe-journey-synth.service
AccuracySec=30s

[Install]
WantedBy=timers.target

deploy/autoswe-synth-bridge.path L1-17

[Unit]
Description=Watch /data/autoswe/synth-trigger and fire autoswe-synth-bridge.service
Documentation=https://github.com/sddcinfo/autoswe

[Path]
# When the in-pod journey-server's POST /synthesize endpoint writes this
# file (via atomic rename), systemd starts autoswe-synth-bridge.service.
# PathExists fires on creation; PathExistsGlob alone isn't enough since we
# need to retrigger if the same path reappears after the service removed it.
PathExists=/data/autoswe/synth-trigger
Unit=autoswe-synth-bridge.service
MakeDirectory=true
DirectoryMode=0755

[Install]
WantedBy=default.target

deploy/autoswe-synth-bridge.service L1-26

[Unit]
Description=autoswe synth bridge - consume /data/autoswe/synth-trigger and run synthesize-journey.sh
Documentation=https://github.com/sddcinfo/autoswe
# Triggered by autoswe-synth-bridge.path on PathExists=. Reads the trigger
# file, removes it, then runs the full host-side synthesis pipeline (pnpm,
# codex, etc.) where credentials live.
After=network.target

[Service]
Type=oneshot
Environment="PATH=%h/.local/share/mise/shims:%h/.local/bin:/usr/local/bin:/usr/bin:/bin"
Environment="HOME=%h"
Environment="JOURNEY_PUBLISH=1"
WorkingDirectory=@AUTOSWE_REPO_ROOT@
# Two steps, atomic-ish: remove the trigger first (so re-triggers during the
# run get coalesced - only one extra run will fire after we finish), then
# execute synth. The trigger file is removed even if synth fails so the
# operator doesn't get stuck retriggering the same broken state.
ExecStartPre=/bin/rm -f /data/autoswe/synth-trigger
ExecStart=/bin/bash @AUTOSWE_REPO_ROOT@/scripts/synthesize-journey.sh synthesize --public --no-codex-cleanup --rebuild-site
TimeoutStartSec=600
StandardOutput=journal
StandardError=journal
SyslogIdentifier=autoswe-synth-bridge
Nice=10

deploy/autoswe-meta-judge.service L1-11

[Unit]
Description=autoswe codex meta-judge - external Codex review of loop behavior
Documentation=https://github.com/sddcinfo/autoswe

[Service]
Type=oneshot
ExecStart=/usr/bin/env python3 @AUTOSWE_REPO_ROOT@/deploy/codex_meta_judge.py
StandardOutput=journal
StandardError=journal
SyslogIdentifier=autoswe-meta-judge

deploy/autoswe-meta-judge.timer L1-13

[Unit]
Description=autoswe codex meta-judge - fires every 30 min to grade the system
Documentation=https://github.com/sddcinfo/autoswe

[Timer]
OnBootSec=5min
OnUnitActiveSec=30min
Unit=autoswe-meta-judge.service
AccuracySec=1min

[Install]
WantedBy=timers.target

deploy/codex-rubric.py L1-71 (showing 40 of 71)

#!/usr/bin/env python3
"""Run one Codex rubric prompt non-interactively and print the codex response.

The per-iter test-judge MUST call this instead of invoking `codex exec`
directly. Running codex without a closed stdin makes it block on "Reading
additional input from stdin" until the orchestrator's 10-minute judge wait
expires (the judge_timeout REVERT seen on 2026-05-22). This passes the prompt
as an argument and closes stdin, so codex returns immediately. Flags mirror
the working deploy/codex-meta-judge.sh invocation.

Usage: python3 deploy/codex-rubric.py <prompt-file>   # prints codex output
"""
from __future__ import annotations

import pathlib
import subprocess
import sys

TIMEOUT_S = 420  # comfortably under the orchestrator's 600s judge wait


def _extract_codex_section(raw: str) -> str:
    """codex exec prints sections delimited by bare 'codex' / 'tokens used'
    marker lines. Return the LAST 'codex' section; fall back to the whole
    output if the markers are absent."""
    out: list[str] = []
    in_codex = False
    for ln in raw.splitlines():
        s = ln.strip()
        if s == "codex":
            in_codex, out = True, []
            continue
        if s == "tokens used":
            if in_codex:
                return "\n".join(out).strip()
            in_codex = False
            continue
        if in_codex:
            out.append(ln)
    return "\n".join(out).strip() if out else raw.strip()

deploy/autoswe-monitor.service L1-22

[Unit]
Description=autoswe system-state monitor - time-series health + performance capture (no blind spots)
Documentation=https://github.com/sddcinfo/autoswe

[Service]
Type=oneshot
Environment="PATH=%h/.local/share/mise/shims:%h/.local/bin:/usr/local/bin:/usr/bin:/bin"
Environment="HOME=%h"
# Source AUTOSWE_VLLM_URL (and any other AUTOSWE_* knobs) from the file that
# `sddc k3s deploy --stack autosre` refreshes on every deploy. The leading
# dash makes a missing file non-fatal: the script falls back to the
# default http://127.0.0.1:8010 if the env file is absent, which is the
# right behavior for pre-migration installs.
EnvironmentFile=-%h/.config/autoswe/cluster.env
# --log appends a flat record (tps, GPU power/clocks, contention, iter/vertex,
# parity, CRIT/WARN) to system-state.jsonl every run -> a queryable time series.
ExecStart=/usr/bin/env python3 @AUTOSWE_REPO_ROOT@/deploy/system_state.py --log
StandardOutput=journal
StandardError=journal
SyslogIdentifier=autoswe-monitor
Nice=15

deploy/system_state.py L1-120 (showing 40 of 120)

#!/usr/bin/env python3
"""autoswe system-state reporter - the single reliable view, no blind spots.

Captures the WHOLE live state of the autoswe loop + its substrate in one shot,
each check defensive (a failure in one never blanks the rest), and flags
anomalies (OK / WARN / CRIT) so problems surface instead of hiding. Built after
a run of blind ad-hoc ssh probes kept missing things (unbounded-wait hang,
0-token vLLM stall, lingering subagents, 2-day-stale site).

Run ON GB10:
  python3 deploy/system_state.py            # human-readable report
  python3 deploy/system_state.py --json     # machine-readable
  python3 deploy/system_state.py --probe-tps  # add a controlled tps benchmark

Exit code: 0 if no CRIT, 1 if any CRIT finding.
"""
from __future__ import annotations

import argparse
import json
import os
import re
import subprocess
import time
import urllib.request
from pathlib import Path

HOME = Path.home()
RUNTIME = HOME / ".autoswe" / "rebuild" / "runtime"
# Repo root with NO monorepo assumption: AUTOSWE_REPO_ROOT wins, else derive from
# this file's location (deploy/system_state.py → parents[1] is the repo root).
# This script runs as a bare `python3 deploy/system_state.py` with no package on
# sys.path, so it cannot import autoswe_cli.paths and must resolve independently.
REPO = Path(os.environ.get("AUTOSWE_REPO_ROOT", str(Path(__file__).resolve().parents[1]))).expanduser()
SDDC = HOME / "sddc"
# vLLM URL is overridable via the AUTOSWE_VLLM_URL env var so the monitor
# follows the autosre vLLM ClusterIP after the k3s migration. The host
# autoswe-monitor.service unit picks the value up from
# ~/.config/autoswe/cluster.env, which `sddc k3s deploy --stack autosre`
# refreshes on every deploy.

deploy/autoswe-monitor.timer L1-13

[Unit]
Description=autoswe system-state monitor - fires every 3 min to capture health + performance
Documentation=https://github.com/sddcinfo/autoswe

[Timer]
OnBootSec=1min
OnUnitActiveSec=3min
Unit=autoswe-monitor.service
AccuracySec=15s

[Install]
WantedBy=timers.target

deploy/autoswe-watchdog.service L1-11

[Unit]
Description=autoswe liveness watchdog - restart autoswe.service if state.json heartbeat is stale
Documentation=https://github.com/sddcinfo/autoswe

[Service]
Type=oneshot
ExecStart=@AUTOSWE_REPO_ROOT@/deploy/liveness-watchdog.py
StandardOutput=journal
StandardError=journal
SyslogIdentifier=autoswe-watchdog

deploy/autoswe-watchdog.timer L1-13

[Unit]
Description=autoswe liveness watchdog - fires every 5 min to catch silent stalls
Documentation=https://github.com/sddcinfo/autoswe

[Timer]
OnBootSec=2min
OnUnitActiveSec=5min
Unit=autoswe-watchdog.service
AccuracySec=30s

[Install]
WantedBy=timers.target