Architecture Overview
The autosre system is a local LLM management stack designed to run Claude Code against a vLLM instance hosted within a k3s cluster, effectively replacing the need for an Anthropic API key by speaking the native Anthropic Messages API locally 1. The architecture centers on a Python CLI that orchestrates the lifecycle of Kubernetes pods, including the vLLM inference engine, an Anthropic-to-OpenAI translation proxy, and auxiliary services like a Playwright-driven browser and local Model Context Protocol (MCP) servers for web fetching and search. Data flows from the user through the CLI into the k3s cluster, where the proxy translates requests to the vLLM backend, while auxiliary tools handle external data retrieval and browser automation.
High-Level Component Flow
Section titled “High-Level Component Flow”The system operates primarily within a k3s environment on a GB10 node, where multiple services run as pods with pinned ClusterIPs. The CLI serves as the control plane, managing the start, stop, and status of these pods via k3s lifecycle helpers. The core inference path involves the CLI invoking the autosre claude command, which configures Claude Code to connect to a local proxy pod. This proxy translates Anthropic Messages API requests into OpenAI Chat Completions before forwarding them to the vLLM pod. Concurrently, local MCP servers provide tools for web fetching, search, and browser automation, which Claude Code can invoke during sessions.
Backend and Inference Architecture
Section titled “Backend and Inference Architecture”The backend abstraction is defined in autosre/backends/, with VllmBackend serving as the primary implementation. This backend acts as a thin client for the k3s-hosted vLLM instance, handling URL resolution, health checks, and model label discovery without managing the pod lifecycle directly. The vLLM pod itself is configured via YAML recipes in autosre/backends/recipes/, which drive both the Helm chart deployment and runtime parity checks. The current production recipe targets the qwen3.6-35b-a3b-nvfp4 model using the vllm/vllm-openai:v0.24.0 image 2.
To handle concurrent workloads, vLLM is configured with priority-based scheduling and chunked prefill. A custom Python hook, vllm_priority_preempt.py, is installed into the vLLM container’s site-packages to enable priority preemption in the V1 scheduler, allowing high-priority requests to evict low-priority ones 1. The proxy pod, running autosre.backends.anthropic_proxy, bridges the gap between the Anthropic API format expected by Claude Code and the OpenAI-compatible API exposed by vLLM.
Auxiliary Services and MCP Servers
Section titled “Auxiliary Services and MCP Servers”Beyond inference, autosre provides several auxiliary services running as k3s pods or local processes. The autosre-browser pod runs a browserless/chromium instance accessible via CDP, supporting tools like screenshot capture, PDF generation, and interactive page automation. Recordings from browser sessions are stored in a host-mounted directory, allowing the local MCP server to discover them.
Local MCP servers handle web interactions and command discovery. The fetch server uses curl_cffi for TLS fingerprint impersonation, while the search server integrates with DuckDuckGo. The capabilities server introspects the live Click command group, allowing Claude Code to discover and execute autosre commands directly. These servers are configured dynamically when launching autosre claude, ensuring the local session has access to these tools without relying on external cloud services.
Lifecycle and Configuration Management
Section titled “Lifecycle and Configuration Management”The CLI manages the infrastructure lifecycle through autosre/k3s_lifecycle.py, which scales deployments, waits for rollouts, and gates operations on GPU state. Configuration is XDG-compliant, stored in ~/.local/share/autosre/ and ~/.config/autosre/. Helm charts in helm/autosre/ define the Kubernetes resources, with values driven by the backend recipes. A parity check ensures the live pod configuration matches the recipe definition, emitting warnings if drift is detected 2.
Performance and health are monitored via autosre/watch.py, which provides a Rich-live interface displaying vLLM metrics, pod logs, TCP connections, and host GPU telemetry 1. The system also includes a self-hosted HTTPS file dropbox subsystem for secure file transfers, managed via systemd units and TLS termination.
# autosre
## Project Overview
Local LLM server management for Claude Code. Python CLI (Click), httpx for HTTP, PyYAML for config.
No Anthropic API key required: the local stack speaks the native Anthropic Messages API.
## Running Locally
k3s-only: this project runs Claude Code against vLLM in k3s on a GB10 (the proxy, codex-proxy,
browser, and vLLM all run as k3s pods at pinned ClusterIPs 10.96.0.30-.34).
Built-in WebFetch/WebSearch are replaced by local MCP servers (`autosre-mcp-fetch`, `autosre-mcp-search`).
## Development
- Python 3.14+, Hatchling build system
- Install: `pip install -e '.[dev]'`
- Lint: `ruff check autosre/ tests/`
- Format: `ruff format autosre/ tests/`
- Type check: `mypy autosre/`
- Test: `pytest`
## Code Conventions
- 100-char line length, double-quoted strings
- Strict mypy, comprehensive ruff rule set
- Lazy imports in CLI commands (from X import Y inside function body)
- Click for CLI, httpx for HTTP, PyYAML for config
- curl_cffi for web fetching (browser TLS fingerprint impersonation)
- XDG-compliant data storage: ~/.local/share/autosre/
- Type annotations on all functions
## Key Architecture
- `autosre/backends/` - Backend ABC + the single `VllmBackend` (a thin k3s client: URL + health + model-label discovery; it does not start/stop anything)
- `autosre/backends/recipes/` - YAML model configs; the runtime source of truth is `helm/autosre/values.yaml` (recipes drive the chart + the parity check)
- `autosre/backends/anthropic_proxy.py` - Anthropic Messages API → OpenAI Chat Completions translator. Runs INSIDE the `autosre-proxy` pod (`python -m autosre.backends.anthropic_proxy 8011 http://autosre-vllm-local:8010`); `codex_proxy.py` is the same for the codex-proxy pod
- `autosre/backends/vllm_priority_preempt.py` - `.pth`-installed sitecustomize hook (mounted into the vLLM container's Python 3.12 site-packages, so it must stay 3.12-syntax-compatible) that adds priority preemption to vLLM's V1 scheduler (evicts low-priority running requests to admit high-priority waiting ones)
- `autosre/mcp_servers/` - Local MCP servers:
- `fetch` / `search` - web fetch (curl_cffi) + web search (DuckDuckGo)
- `capabilities` - command-discovery server that introspects the live autosre click group so Claude can search autosre commands with `list_modules`, `search_commands`, `get_command`
- `browser` - Playwright-driven browser tools backed by the `autosre-browser` pod; complements `fetch`/`search` for JS-rendered, login-walled, or interactive pages (10 tools: `browser_render_markdown`, `browser_screenshot`, `browser_pdf`, `browser_search`, `browser_session_open`, `browser_snapshot`, `browser_act`, `browser_session_list_tabs`, `browser_session_close`, `browser_prune_recordings`)
- `autosre/services/` - Auxiliary non-LLM services in the autosre lifecycle:
- `browserless.py` - `BrowserlessService` is a thin client for the `autosre-browser` k3s pod (browserless/chromium); no docker lifecycle. The CDP URL derives from the pinned ClusterIP (`config.get_browser_url()` = http://10.96.0.33:3010); an optional token comes from `AUTOSRE_BROWSER_TOKEN` (the internal pod runs token-less by default). Recording is **opt-in per `browser_session_open(record=True)`**: the pod writes `.webm` to `RECORDING_PATH=/recordings`, which the chart mounts (hostPath via `browser.recordingHostPath`, else emptyDir) so the host MCP can discover them; pruned after 24h (override via `AUTOSRE_BROWSER_RECORDING_TTL`). **Do not** pass `record=True` for sessions that fill credentials. browserless v2 is AGPL - POC use only, not for redistribution. `browser_render_markdown` defaults to `wait_until="domcontentloaded"`; if a page is missing content, retry with `wait_for_selector="<thing you want>"` rather than blindly switching to `networkidle` (which hangs on sites with persistent SSE/analytics).
- `autosre/cli.py` - All Click commands (command groups: `setup`, `start`, `stop`, `status`, `claude`, `codex`, `k3s`, `mcp`, `swarm`, `dropbox`, `cluster`, `provision`, `models`, `keys`, `ssh`, `configure`, `demo`, `eval`, `dedicated`, `ui`, `metrics`, `watch`, `bench`, `backends`)
| No `--enforce-eager` | (absent) | CUDA graphs enabled. Stock vllm/vllm-openai supports SM121 graph capture on the 2026-04+ images. Gives +80% translation TPS, +46% coding TPS vs eager mode. First-request TTFT pays ~1s graph-capture cost (absorbed by the harness warmup phase). |
### Container env vars (codified in `vllm_serve.build_runtime_env`)
`vllm_serve.build_runtime_env(recipe)` is the single source of truth for the env vars the chart passes to the vLLM container. The helm deployment renders from the recipe, and the recipe-parity check diffs the live pod against it.
| Env var | Value | Why |
|---|---|---|
| `HF_HUB_OFFLINE` | `1` | Block any network I/O at boot - otherwise the container races systemd-resolved at cold boot and crash-loops on `Temporary failure in name resolution`. |
| `TRANSFORMERS_OFFLINE` | `1` | Companion to `HF_HUB_OFFLINE` for the transformers library. |
| `HF_HUB_CACHE` | `/data/huggingface/hub` | Direct the HF library at the canonical bind-mount where `meeting-scribe gb10 pull-models` places weights. Without this the library defaults to `/root/.cache/huggingface/hub` and a fresh container crash-loops with `LocalEntryNotFoundError` if the user-cache mount doesn't have the model yet. |
| `NVIDIA_DISABLE_REQUIRE` | `1` | Disable the upstream image's strict `NVIDIA_REQUIRE_CUDA` envelope; some legitimate driver versions on GB10 fall outside the baked-in range. |
| `CUBLAS_WORKSPACE_CONFIG` | `:4096:8` | Pre-sizes 8 × 4 MiB cuBLAS workspaces - required to prevent the `CUBLAS_STATUS_INTERNAL_ERROR` cascade observed 2026-04-18 23:45 under burst concurrent prefill. (Set via the recipe's `env:` block.) |
| `VLLM_MARLIN_USE_ATOMIC_ADD` | `1` | NVIDIA DGX Spark thread 366822 recommendation; companion to the cuBLAS workspace config. (Set via the recipe's `env:` block.) |
| `VLLM_ALLOW_LONG_MAX_MODEL_LEN` | `1` | Allow `max_model_len=262144` despite vLLM's default safety cap. (Set via the recipe's `env:` block.) |
| `HF_TOKEN` | `$HF_TOKEN` from operator environ | Forwarded only when present in the operator's environment; not strictly required at runtime since the container is fully offline, but kept available for the `pull-models` path that runs at install time. |
| `VLLM_USE_DEEP_GEMM` | `0` | vLLM 0.24.0 selects the DeepGEMM FP8-MoE backend for this model family on Blackwell; its E8M0 scale-factor layout crashes on sm_121 ("Unknown SF transformation"). Forcing it off routes MoE to the working backends (MARLIN for the NVFP4 experts). (Set via the recipe's `env:` block.) |
### Boot-time recipe-parity sentinel
`VllmBackend.warn_on_recipe_drift(recipe, api_port)` diffs the live `vllm serve ...` cmdline against what `vllm_serve.build_vllm_serve_cmd(recipe)` would have produced and emits a WARNING per mismatch. Catches drift classes the recipe-edit guard alone can't see: helm values hand-edited away from the recipe, stale pod state, and missing `extra_args`. Deliberately non-fatal; the CI test in `tests/test_vllm.py::TestRecipeParityCheck` is the strict gate.
### Deferred tuning items
- **Speculative decoding** - MTP (`qwen3_next_mtp`) loads eager-only on 0.24.0 (cudagraph capture host-OOMs on GB10) and is superseded there by DFlash, which is blocked on a production-safe 35B-A3B drafter (the only public one is an uncensored abliteration) and requires BF16 KV. Not shippable without a drafter plus a correctness pass.
- **Native FP4 MoE backends** (`--moe-backend flashinfer|cutlass`) - sm_121 lacks the tcgen05 tensor cores the native path needs; measured ~16% slower than MARLIN aggregate (vllm#43906). Revisit when a vLLM/FlashInfer release adds sm_121 tensor-core FP4.
- **`--quantization=auto_round`** - dispatches to gptq_marlin on Blackwell. Research-grade, not a flag flip.
## Hooks - owned by the marketplace plugins
All Claude Code hook behavior (Bash guard, post-Bash audit, session-end checklist, branch warning, pre-compact context injection, subagent context injection, plan-review chain) lives in the `bradlay/claude-code-recipes` marketplace plugin set. autosre no longer ships any Claude Code hooks of its own.
### Bare `claude` (online mode)
Once the marketplace is added (`/plugin marketplace add bradlay/claude-code-recipes`) and the plugins are enabled in `~/.claude/settings.json`, every hook fires automatically. autosre has no install/uninstall step here.
The bash-guard plugin reads custom rules from `${XDG_CONFIG_HOME}/claude-bash-guard/rules.yaml` (or `$CLAUDE_BASH_GUARD_RULES_FILE`). The sddcinfo + autosre opinionated rules (python-c uplift, venv backdoors, recipe-yaml protection, sddc-CLI routing, etc.) live there.
### `autosre claude` (offline / local mode)