Deployment & Infrastructure
The autoswe deployment relies on a combination of systemd units and Kubernetes configurations to manage the autonomous loop, site generation, and system health monitoring. The core architecture separates concerns into distinct services: a static HTTP server for the published site, a synthesis pipeline for regenerating content, a meta-judge for external review, a system-state monitor for health metrics, and a watchdog for liveness checks. These components are orchestrated via systemd timers and path triggers to ensure continuous operation and rapid response to state changes.
Static Site Server
Section titled “Static Site Server”The autoswe-journey-server.service runs a simple Python HTTP server that serves the published static site in “publish mode” 1. It binds to 0.0.0.0 on port 8765 and serves only the leak-scanned static dist/ directory and snapshots. Raw debug surfaces such as /api/journey, /raw, and /status return 404 errors because the frontend no longer fetches from /api. The service restarts on failure with a 10-second delay and logs to the journal.
Site Synthesis Pipeline
Section titled “Site Synthesis Pipeline”Site regeneration is handled by the autoswe-journey-synth.service, which is a oneshot service that regenerates site-data.json and per-iteration content 2. It executes scripts/synthesize-journey.sh with flags --public, --no-codex-cleanup, and --rebuild-site. The --public flag enforces strict, fail-closed redaction, while --no-codex-cleanup ensures structured data is preserved and keeps the local model out of the regeneration path. The service runs with a nice value of 10 and sets JOURNEY_PUBLISH=1.
This synthesis is triggered periodically by the autoswe-journey-synth.timer, which fires every 15 minutes after an initial 2-minute delay on boot 3. The timer has an accuracy of 30 seconds.
Additionally, a file-system trigger mechanism exists via autoswe-synth-bridge.path and autoswe-synth-bridge.service 4 5. The path unit watches for the creation of /data/autoswe/synth-trigger 4. When this file appears, the bridge service is triggered. The bridge service first removes the trigger file to coalesce re-triggers during execution, then runs the same synthesis pipeline as the timer 5. It has a timeout of 600 seconds and runs with a nice value of 10.
Meta-Judge
Section titled “Meta-Judge”The autoswe-meta-judge.service runs an external Codex review of the loop behavior 6. It executes deploy/codex_meta_judge.py as a oneshot service. The autoswe-meta-judge.timer fires this service every 30 minutes after a 5-minute boot delay, with an accuracy of 1 minute 7.
The meta-judge uses deploy/codex-rubric.py to invoke the Codex CLI non-interactively 8. This wrapper passes the prompt as an argument and closes stdin to prevent blocking on “Reading additional input from stdin”. It uses the gpt-5.5 model with high reasoning effort and has a timeout of 420 seconds. The script extracts the last “codex” section from the output, falling back to the whole output if markers are absent.
System State Monitor
Section titled “System State Monitor”The autoswe-monitor.service captures time-series health and performance data 9. It executes deploy/system_state.py with the --log flag, which appends a flat record to system-state.jsonl 10. The monitor reads environment variables from ~/.config/autoswe/cluster.env if present, falling back to http://127.0.0.1:8010 for the vLLM URL 9 10. The service runs with a nice value of 15 9.
The autoswe-monitor.timer fires the monitor every 3 minutes after a 1-minute boot delay, with an accuracy of 15 seconds 11.
The system_state.py script performs several checks:
- Loop: Checks
state.jsonandstate-heartbeat.jsonfor iteration status and heartbeat age 10. A heartbeat older than 900 seconds is CRIT, and older than 600 seconds is WARN. - Subagents: Checks for lingering subagents exceeding their phase budgets (planner ~10m, writer ~15m, judge ~12m).
- vLLM: Queries
/metricsfor running/waiting requests and calculates live tokens per second (TPS). A stall is detected if requests are running but TPS is 0 for 6 seconds. - GPU: Uses
nvidia-smito check utilization, clocks, power, temperature, and throttling. - Parity: Parses
acceptance/parity-matrix.yamlto count green, red, manual, and n/a states. - Verdicts: Reads the last 5 lines of
verdicts.jsonl. - Site: Checks the age of
journey-web/dist/index.htmland syncs withsite-data.json. - Codex Review: Checks the age and health score of the latest meta-judge verdict.
Liveness Watchdog
Section titled “Liveness Watchdog”The autoswe-watchdog.service is a oneshot service that runs deploy/liveness-watchdog.py 12. It is designed to restart the main autoswe.service if the state.json heartbeat is stale. The autoswe-watchdog.timer fires this service every 5 minutes after a 2-minute boot delay, with an accuracy of 30 seconds 13.
[Unit]
Description=autoswe journey HTTP server - serves the published static site (publish mode)
Documentation=https://github.com/sddcinfo/autoswe
[Service]
Type=simple
WorkingDirectory=@AUTOSWE_REPO_ROOT@
Environment="PATH=%h/.local/share/mise/shims:%h/.local/bin:/usr/local/bin:/usr/bin:/bin"
# --public: serve ONLY the leak-scanned static dist/ (+ snapshots); the raw
# debug surfaces (/api/journey, /raw, /status) are 404. The frontend no longer
# fetches /api, so nothing breaks.
ExecStart=/usr/bin/python3 @AUTOSWE_REPO_ROOT@/scripts/journey_server.py --bind 0.0.0.0 --port 8765 --public
Restart=on-failure
RestartSec=10
StandardOutput=journal
StandardError=journal
SyslogIdentifier=autoswe-journey-server
[Install]
WantedBy=default.target
[Unit]
Description=autoswe journey synthesizer - regenerates site-data.json + per-iter content and rebuilds the published site (leak-gated)
Documentation=https://github.com/sddcinfo/autoswe
[Service]
Type=oneshot
# PATH includes mise shims so --rebuild-site can invoke npm/node. --public
# (strict, fail-closed redaction) + --no-codex-cleanup (structured data needs no
# AI-speak strip and it keeps the local model out of the regen path).
Environment="PATH=%h/.local/share/mise/shims:%h/.local/bin:/usr/local/bin:/usr/bin:/bin"
Environment="HOME=%h"
Environment="JOURNEY_PUBLISH=1"
ExecStart=/bin/bash @AUTOSWE_REPO_ROOT@/scripts/synthesize-journey.sh synthesize --public --no-codex-cleanup --rebuild-site
StandardOutput=journal
StandardError=journal
SyslogIdentifier=autoswe-journey-synth
Nice=10
[Unit]
Description=autoswe journey synthesizer - fires every 15 min to refresh aggregate docs + 8765 site
Documentation=https://github.com/sddcinfo/autoswe
[Timer]
OnBootSec=2min
OnUnitActiveSec=15min
Unit=autoswe-journey-synth.service
AccuracySec=30s
[Install]
WantedBy=timers.target
[Unit]
Description=Watch /data/autoswe/synth-trigger and fire autoswe-synth-bridge.service
Documentation=https://github.com/sddcinfo/autoswe
[Path]
# When the in-pod journey-server's POST /synthesize endpoint writes this
# file (via atomic rename), systemd starts autoswe-synth-bridge.service.
# PathExists fires on creation; PathExistsGlob alone isn't enough since we
# need to retrigger if the same path reappears after the service removed it.
PathExists=/data/autoswe/synth-trigger
Unit=autoswe-synth-bridge.service
MakeDirectory=true
DirectoryMode=0755
[Install]
WantedBy=default.target
[Unit]
Description=autoswe synth bridge - consume /data/autoswe/synth-trigger and run synthesize-journey.sh
Documentation=https://github.com/sddcinfo/autoswe
# Triggered by autoswe-synth-bridge.path on PathExists=. Reads the trigger
# file, removes it, then runs the full host-side synthesis pipeline (pnpm,
# codex, etc.) where credentials live.
After=network.target
[Service]
Type=oneshot
Environment="PATH=%h/.local/share/mise/shims:%h/.local/bin:/usr/local/bin:/usr/bin:/bin"
Environment="HOME=%h"
Environment="JOURNEY_PUBLISH=1"
WorkingDirectory=@AUTOSWE_REPO_ROOT@
# Two steps, atomic-ish: remove the trigger first (so re-triggers during the
# run get coalesced - only one extra run will fire after we finish), then
# execute synth. The trigger file is removed even if synth fails so the
# operator doesn't get stuck retriggering the same broken state.
ExecStartPre=/bin/rm -f /data/autoswe/synth-trigger
ExecStart=/bin/bash @AUTOSWE_REPO_ROOT@/scripts/synthesize-journey.sh synthesize --public --no-codex-cleanup --rebuild-site
TimeoutStartSec=600
StandardOutput=journal
StandardError=journal
SyslogIdentifier=autoswe-synth-bridge
Nice=10
[Unit]
Description=autoswe codex meta-judge - external Codex review of loop behavior
Documentation=https://github.com/sddcinfo/autoswe
[Service]
Type=oneshot
ExecStart=/usr/bin/env python3 @AUTOSWE_REPO_ROOT@/deploy/codex_meta_judge.py
StandardOutput=journal
StandardError=journal
SyslogIdentifier=autoswe-meta-judge
[Unit]
Description=autoswe codex meta-judge - fires every 30 min to grade the system
Documentation=https://github.com/sddcinfo/autoswe
[Timer]
OnBootSec=5min
OnUnitActiveSec=30min
Unit=autoswe-meta-judge.service
AccuracySec=1min
[Install]
WantedBy=timers.target
#!/usr/bin/env python3
"""Run one Codex rubric prompt non-interactively and print the codex response.
The per-iter test-judge MUST call this instead of invoking `codex exec`
directly. Running codex without a closed stdin makes it block on "Reading
additional input from stdin" until the orchestrator's 10-minute judge wait
expires (the judge_timeout REVERT seen on 2026-05-22). This passes the prompt
as an argument and closes stdin, so codex returns immediately. Flags mirror
the working deploy/codex-meta-judge.sh invocation.
Usage: python3 deploy/codex-rubric.py <prompt-file> # prints codex output
"""
from __future__ import annotations
import pathlib
import subprocess
import sys
TIMEOUT_S = 420 # comfortably under the orchestrator's 600s judge wait
def _extract_codex_section(raw: str) -> str:
"""codex exec prints sections delimited by bare 'codex' / 'tokens used'
marker lines. Return the LAST 'codex' section; fall back to the whole
output if the markers are absent."""
out: list[str] = []
in_codex = False
for ln in raw.splitlines():
s = ln.strip()
if s == "codex":
in_codex, out = True, []
continue
if s == "tokens used":
if in_codex:
return "\n".join(out).strip()
in_codex = False
continue
if in_codex:
out.append(ln)
return "\n".join(out).strip() if out else raw.strip()
[Unit]
Description=autoswe system-state monitor - time-series health + performance capture (no blind spots)
Documentation=https://github.com/sddcinfo/autoswe
[Service]
Type=oneshot
Environment="PATH=%h/.local/share/mise/shims:%h/.local/bin:/usr/local/bin:/usr/bin:/bin"
Environment="HOME=%h"
# Source AUTOSWE_VLLM_URL (and any other AUTOSWE_* knobs) from the file that
# `sddc k3s deploy --stack autosre` refreshes on every deploy. The leading
# dash makes a missing file non-fatal: the script falls back to the
# default http://127.0.0.1:8010 if the env file is absent, which is the
# right behavior for pre-migration installs.
EnvironmentFile=-%h/.config/autoswe/cluster.env
# --log appends a flat record (tps, GPU power/clocks, contention, iter/vertex,
# parity, CRIT/WARN) to system-state.jsonl every run -> a queryable time series.
ExecStart=/usr/bin/env python3 @AUTOSWE_REPO_ROOT@/deploy/system_state.py --log
StandardOutput=journal
StandardError=journal
SyslogIdentifier=autoswe-monitor
Nice=15
#!/usr/bin/env python3
"""autoswe system-state reporter - the single reliable view, no blind spots.
Captures the WHOLE live state of the autoswe loop + its substrate in one shot,
each check defensive (a failure in one never blanks the rest), and flags
anomalies (OK / WARN / CRIT) so problems surface instead of hiding. Built after
a run of blind ad-hoc ssh probes kept missing things (unbounded-wait hang,
0-token vLLM stall, lingering subagents, 2-day-stale site).
Run ON GB10:
python3 deploy/system_state.py # human-readable report
python3 deploy/system_state.py --json # machine-readable
python3 deploy/system_state.py --probe-tps # add a controlled tps benchmark
Exit code: 0 if no CRIT, 1 if any CRIT finding.
"""
from __future__ import annotations
import argparse
import json
import os
import re
import subprocess
import time
import urllib.request
from pathlib import Path
HOME = Path.home()
RUNTIME = HOME / ".autoswe" / "rebuild" / "runtime"
# Repo root with NO monorepo assumption: AUTOSWE_REPO_ROOT wins, else derive from
# this file's location (deploy/system_state.py → parents[1] is the repo root).
# This script runs as a bare `python3 deploy/system_state.py` with no package on
# sys.path, so it cannot import autoswe_cli.paths and must resolve independently.
REPO = Path(os.environ.get("AUTOSWE_REPO_ROOT", str(Path(__file__).resolve().parents[1]))).expanduser()
SDDC = HOME / "sddc"
# vLLM URL is overridable via the AUTOSWE_VLLM_URL env var so the monitor
# follows the autosre vLLM ClusterIP after the k3s migration. The host
# autoswe-monitor.service unit picks the value up from
# ~/.config/autoswe/cluster.env, which `sddc k3s deploy --stack autosre`
# refreshes on every deploy.
[Unit]
Description=autoswe system-state monitor - fires every 3 min to capture health + performance
Documentation=https://github.com/sddcinfo/autoswe
[Timer]
OnBootSec=1min
OnUnitActiveSec=3min
Unit=autoswe-monitor.service
AccuracySec=15s
[Install]
WantedBy=timers.target
[Unit]
Description=autoswe liveness watchdog - restart autoswe.service if state.json heartbeat is stale
Documentation=https://github.com/sddcinfo/autoswe
[Service]
Type=oneshot
ExecStart=@AUTOSWE_REPO_ROOT@/deploy/liveness-watchdog.py
StandardOutput=journal
StandardError=journal
SyslogIdentifier=autoswe-watchdog
[Unit]
Description=autoswe liveness watchdog - fires every 5 min to catch silent stalls
Documentation=https://github.com/sddcinfo/autoswe
[Timer]
OnBootSec=2min
OnUnitActiveSec=5min
Unit=autoswe-watchdog.service
AccuracySec=30s
[Install]
WantedBy=timers.target