Skip to content

Deployment & Infrastructure

The autoswe deployment relies on a combination of systemd units and Kubernetes configurations to manage the autonomous loop, site generation, and system health monitoring. The core architecture separates concerns into distinct services: a static HTTP server for the published site, a synthesis pipeline for regenerating content, a meta-judge for external review, a system-state monitor for health metrics, and a watchdog for liveness checks. These components are orchestrated via systemd timers and path triggers to ensure continuous operation and rapid response to state changes.

The autoswe-journey-server.service runs a simple Python HTTP server that serves the published static site in “publish mode” 1. It binds to 0.0.0.0 on port 8765 and serves only the leak-scanned static dist/ directory and snapshots. Raw debug surfaces such as /api/journey, /raw, and /status return 404 errors because the frontend no longer fetches from /api. The service restarts on failure with a 10-second delay and logs to the journal.

Site regeneration is handled by the autoswe-journey-synth.service, which is a oneshot service that regenerates site-data.json and per-iteration content 2. It executes scripts/synthesize-journey.sh with flags --public, --no-codex-cleanup, and --rebuild-site. The --public flag enforces strict, fail-closed redaction, while --no-codex-cleanup ensures structured data is preserved and keeps the local model out of the regeneration path. The service runs with a nice value of 10 and sets JOURNEY_PUBLISH=1.

This synthesis is triggered periodically by the autoswe-journey-synth.timer, which fires every 15 minutes after an initial 2-minute delay on boot 3. The timer has an accuracy of 30 seconds.

Additionally, a file-system trigger mechanism exists via autoswe-synth-bridge.path and autoswe-synth-bridge.service 4 5. The path unit watches for the creation of /data/autoswe/synth-trigger 4. When this file appears, the bridge service is triggered. The bridge service first removes the trigger file to coalesce re-triggers during execution, then runs the same synthesis pipeline as the timer 5. It has a timeout of 600 seconds and runs with a nice value of 10.

The autoswe-meta-judge.service runs an external Codex review of the loop behavior 6. It executes deploy/codex_meta_judge.py as a oneshot service. The autoswe-meta-judge.timer fires this service every 30 minutes after a 5-minute boot delay, with an accuracy of 1 minute 7.

The meta-judge uses deploy/codex-rubric.py to invoke the Codex CLI non-interactively 8. This wrapper passes the prompt as an argument and closes stdin to prevent blocking on “Reading additional input from stdin”. It uses the gpt-5.5 model with high reasoning effort and has a timeout of 420 seconds. The script extracts the last “codex” section from the output, falling back to the whole output if markers are absent.

The autoswe-monitor.service captures time-series health and performance data 9. It executes deploy/system_state.py with the --log flag, which appends a flat record to system-state.jsonl 10. The monitor reads environment variables from ~/.config/autoswe/cluster.env if present, falling back to http://127.0.0.1:8010 for the vLLM URL 9 10. The service runs with a nice value of 15 9.

The autoswe-monitor.timer fires the monitor every 3 minutes after a 1-minute boot delay, with an accuracy of 15 seconds 11.

The system_state.py script performs several checks:

  • Loop: Checks state.json and state-heartbeat.json for iteration status and heartbeat age 10. A heartbeat older than 900 seconds is CRIT, and older than 600 seconds is WARN.
  • Subagents: Checks for lingering subagents exceeding their phase budgets (planner ~10m, writer ~15m, judge ~12m).
  • vLLM: Queries /metrics for running/waiting requests and calculates live tokens per second (TPS). A stall is detected if requests are running but TPS is 0 for 6 seconds.
  • GPU: Uses nvidia-smi to check utilization, clocks, power, temperature, and throttling.
  • Parity: Parses acceptance/parity-matrix.yaml to count green, red, manual, and n/a states.
  • Verdicts: Reads the last 5 lines of verdicts.jsonl.
  • Site: Checks the age of journey-web/dist/index.html and syncs with site-data.json.
  • Codex Review: Checks the age and health score of the latest meta-judge verdict.

The autoswe-watchdog.service is a oneshot service that runs deploy/liveness-watchdog.py 12. It is designed to restart the main autoswe.service if the state.json heartbeat is stale. The autoswe-watchdog.timer fires this service every 5 minutes after a 2-minute boot delay, with an accuracy of 30 seconds 13.

diagram