Skip to content

Architecture Overview

The autosre system is a local LLM management stack designed to run Claude Code against a vLLM instance hosted within a k3s cluster, effectively replacing the need for an Anthropic API key by speaking the native Anthropic Messages API locally 1. The architecture centers on a Python CLI that orchestrates the lifecycle of Kubernetes pods, including the vLLM inference engine, an Anthropic-to-OpenAI translation proxy, and auxiliary services like a Playwright-driven browser and local Model Context Protocol (MCP) servers for web fetching and search. Data flows from the user through the CLI into the k3s cluster, where the proxy translates requests to the vLLM backend, while auxiliary tools handle external data retrieval and browser automation.

The system operates primarily within a k3s environment on a GB10 node, where multiple services run as pods with pinned ClusterIPs. The CLI serves as the control plane, managing the start, stop, and status of these pods via k3s lifecycle helpers. The core inference path involves the CLI invoking the autosre claude command, which configures Claude Code to connect to a local proxy pod. This proxy translates Anthropic Messages API requests into OpenAI Chat Completions before forwarding them to the vLLM pod. Concurrently, local MCP servers provide tools for web fetching, search, and browser automation, which Claude Code can invoke during sessions.

diagram

The backend abstraction is defined in autosre/backends/, with VllmBackend serving as the primary implementation. This backend acts as a thin client for the k3s-hosted vLLM instance, handling URL resolution, health checks, and model label discovery without managing the pod lifecycle directly. The vLLM pod itself is configured via YAML recipes in autosre/backends/recipes/, which drive both the Helm chart deployment and runtime parity checks. The current production recipe targets the qwen3.6-35b-a3b-nvfp4 model using the vllm/vllm-openai:v0.24.0 image 2.

To handle concurrent workloads, vLLM is configured with priority-based scheduling and chunked prefill. A custom Python hook, vllm_priority_preempt.py, is installed into the vLLM container’s site-packages to enable priority preemption in the V1 scheduler, allowing high-priority requests to evict low-priority ones 1. The proxy pod, running autosre.backends.anthropic_proxy, bridges the gap between the Anthropic API format expected by Claude Code and the OpenAI-compatible API exposed by vLLM.

Beyond inference, autosre provides several auxiliary services running as k3s pods or local processes. The autosre-browser pod runs a browserless/chromium instance accessible via CDP, supporting tools like screenshot capture, PDF generation, and interactive page automation. Recordings from browser sessions are stored in a host-mounted directory, allowing the local MCP server to discover them.

Local MCP servers handle web interactions and command discovery. The fetch server uses curl_cffi for TLS fingerprint impersonation, while the search server integrates with DuckDuckGo. The capabilities server introspects the live Click command group, allowing Claude Code to discover and execute autosre commands directly. These servers are configured dynamically when launching autosre claude, ensuring the local session has access to these tools without relying on external cloud services.

The CLI manages the infrastructure lifecycle through autosre/k3s_lifecycle.py, which scales deployments, waits for rollouts, and gates operations on GPU state. Configuration is XDG-compliant, stored in ~/.local/share/autosre/ and ~/.config/autosre/. Helm charts in helm/autosre/ define the Kubernetes resources, with values driven by the backend recipes. A parity check ensures the live pod configuration matches the recipe definition, emitting warnings if drift is detected 2.

Performance and health are monitored via autosre/watch.py, which provides a Rich-live interface displaying vLLM metrics, pod logs, TCP connections, and host GPU telemetry 1. The system also includes a self-hosted HTTPS file dropbox subsystem for secure file transfers, managed via systemd units and TLS termination.