Skip to content

Inference Backends

The vLLM backend implementation in autosre is designed exclusively for a k3s-only environment, where vLLM serves as the single inference backend 1. The architecture relies on a thin client layer that resolves URLs, checks health, and discovers model labels without managing the lifecycle of the pods themselves 2. Communication with the inference engine is mediated through an Anthropic Messages API proxy pod, which translates requests from Claude Code into vLLM’s OpenAI Chat Completions format. This separation ensures that the host machine does not need to run the inference server directly, relying instead on the pinned ClusterIP addresses of the k3s pods.

The backend system is rooted in an abstract Backend interface defined in autosre/backends/base.py 3. This interface defines standard methods for setup, status reporting, health checks, and environment variable generation for client tools like Claude Code. The BackendType enum restricts valid backends to VLLM, reflecting the k3s-only constraint.

The factory function get_backend in autosre/backends/__init__.py enforces this constraint 1. It accepts a backend_type argument for call-site compatibility but raises a ValueError if the type is not vllm or None. The detect_platform function also hardcodes the return of BackendType.VLLM 3.

diagram

The VllmBackend class in autosre/backends/vllm.py acts as the client for the vLLM pod running in k3s 2. It does not start or stop the server; deployment is handled by autosre k3s up and scaling by autosre start / autosre stop.

The backend defines three distinct ports for different client interactions:

  • api_port (8010): The direct vLLM pod URL.
  • proxy_port (8011): The Anthropic API proxy port for Claude Code.
  • codex_proxy_port (8012): The OpenAI Responses-API proxy port for the Codex CLI.

The get_api_url method returns the vLLM pod URL via get_vllm_url().

The is_healthy method implements a two-tier health check strategy. It first probes the proxy pod’s /health endpoint. A 200 status is insufficient on its own because the proxy may report backend: loading|error even if the pod is up. The method requires backend == ok (or an empty/missing field for older proxies) to return True. If the proxy check fails, it falls back to a direct probe of the vLLM pod’s /health endpoint.

The models dictionary maps short keys to full model IDs, such as qwen3.6-nvfp4 to nvidia/Qwen3.6-35B-A3B-NVFP4. The default_model is set to qwen3.6-nvfp4.

The get_claude_model_arg method resolves the model label for Claude Code. Claude Code validates the --model argument against its built-in registry, rejecting raw recipe keys like qwen3.6-fp8. The proxy advertises a claude-* ID that passes this validation. Resolution follows this order:

  1. The AUTOSRE_CLAUDE_LABEL environment variable.
  2. The first id from the proxy’s /v1/models endpoint, if it starts with claude-.
  3. A RuntimeError if the proxy is unreachable.

The VllmBackend class includes static methods for comparing the live vLLM command line against the defined recipe. The assert_running_vllm_matches_recipe method delegates to vllm_serve.assert_running_vllm_matches_recipe to identify mismatches. The warn_on_recipe_drift method logs warnings for any discrepancies without raising exceptions. These methods are used by the drift test and CI pipelines to ensure parity between the Helm chart configuration and the running pod.

Configuration for the vLLM deployment is managed via the VllmConfig dataclass in autosre/backends/vllm_config.py 4. This configuration is persisted at ~/.local/share/autosre/vllm.yaml and created by autosre configure vllm --node <ip>.

The VllmConfig dataclass includes:

  • nodes: A list of GB10Node objects.
  • docker_image: Defaults to vllm/vllm-openai:latest.
  • docker_image_fallback: Defaults to vllm/vllm-openai:nightly.
  • hf_cache_dir: Defaults to /data/huggingface.
  • nccl_socket_ifname: Defaults to enp1s0f0np0.

The head_node property identifies the first node with the HEAD role, or the first node if no explicit head is defined. The worker_nodes property returns all non-head nodes. The is_cluster property returns True if more than one node is configured.