Inference Backends
The vLLM backend implementation in autosre is designed exclusively for a k3s-only environment, where vLLM serves as the single inference backend 1. The architecture relies on a thin client layer that resolves URLs, checks health, and discovers model labels without managing the lifecycle of the pods themselves 2. Communication with the inference engine is mediated through an Anthropic Messages API proxy pod, which translates requests from Claude Code into vLLM’s OpenAI Chat Completions format. This separation ensures that the host machine does not need to run the inference server directly, relying instead on the pinned ClusterIP addresses of the k3s pods.
Backend Factory and Abstract Interface
Section titled “Backend Factory and Abstract Interface”The backend system is rooted in an abstract Backend interface defined in autosre/backends/base.py 3. This interface defines standard methods for setup, status reporting, health checks, and environment variable generation for client tools like Claude Code. The BackendType enum restricts valid backends to VLLM, reflecting the k3s-only constraint.
The factory function get_backend in autosre/backends/__init__.py enforces this constraint 1. It accepts a backend_type argument for call-site compatibility but raises a ValueError if the type is not vllm or None. The detect_platform function also hardcodes the return of BackendType.VLLM 3.
vLLM Client Implementation
Section titled “vLLM Client Implementation”The VllmBackend class in autosre/backends/vllm.py acts as the client for the vLLM pod running in k3s 2. It does not start or stop the server; deployment is handled by autosre k3s up and scaling by autosre start / autosre stop.
URL Resolution and Ports
Section titled “URL Resolution and Ports”The backend defines three distinct ports for different client interactions:
api_port(8010): The direct vLLM pod URL.proxy_port(8011): The Anthropic API proxy port for Claude Code.codex_proxy_port(8012): The OpenAI Responses-API proxy port for the Codex CLI.
The get_api_url method returns the vLLM pod URL via get_vllm_url().
Health Checks
Section titled “Health Checks”The is_healthy method implements a two-tier health check strategy. It first probes the proxy pod’s /health endpoint. A 200 status is insufficient on its own because the proxy may report backend: loading|error even if the pod is up. The method requires backend == ok (or an empty/missing field for older proxies) to return True. If the proxy check fails, it falls back to a direct probe of the vLLM pod’s /health endpoint.
Model Discovery and Labels
Section titled “Model Discovery and Labels”The models dictionary maps short keys to full model IDs, such as qwen3.6-nvfp4 to nvidia/Qwen3.6-35B-A3B-NVFP4. The default_model is set to qwen3.6-nvfp4.
The get_claude_model_arg method resolves the model label for Claude Code. Claude Code validates the --model argument against its built-in registry, rejecting raw recipe keys like qwen3.6-fp8. The proxy advertises a claude-* ID that passes this validation. Resolution follows this order:
- The
AUTOSRE_CLAUDE_LABELenvironment variable. - The first
idfrom the proxy’s/v1/modelsendpoint, if it starts withclaude-. - A
RuntimeErrorif the proxy is unreachable.
Recipe Management and Drift Detection
Section titled “Recipe Management and Drift Detection”The VllmBackend class includes static methods for comparing the live vLLM command line against the defined recipe. The assert_running_vllm_matches_recipe method delegates to vllm_serve.assert_running_vllm_matches_recipe to identify mismatches. The warn_on_recipe_drift method logs warnings for any discrepancies without raising exceptions. These methods are used by the drift test and CI pipelines to ensure parity between the Helm chart configuration and the running pod.
Configuration and Node Management
Section titled “Configuration and Node Management”Configuration for the vLLM deployment is managed via the VllmConfig dataclass in autosre/backends/vllm_config.py 4. This configuration is persisted at ~/.local/share/autosre/vllm.yaml and created by autosre configure vllm --node <ip>.
The VllmConfig dataclass includes:
nodes: A list ofGB10Nodeobjects.docker_image: Defaults tovllm/vllm-openai:latest.docker_image_fallback: Defaults tovllm/vllm-openai:nightly.hf_cache_dir: Defaults to/data/huggingface.nccl_socket_ifname: Defaults toenp1s0f0np0.
The head_node property identifies the first node with the HEAD role, or the first node if no explicit head is defined. The worker_nodes property returns all non-head nodes. The is_cluster property returns True if more than one node is configured.
"""Backend implementations. k3s-only: vLLM is the single backend."""
from .base import Backend, BackendType, detect_platform
from .vllm import VllmBackend
__all__ = [
"Backend",
"BackendType",
"VllmBackend",
"detect_platform",
"get_backend",
]
def get_backend(
backend_type: BackendType | str | None = None,
active_state: dict[str, object] | None = None,
) -> Backend:
"""Return the vLLM-in-k3s backend.
The stack is k3s-only, so vLLM is the sole backend. ``backend_type`` is
accepted for call-site compatibility but only ``vllm`` is valid.
Args:
backend_type: Must be ``vllm`` / ``BackendType.VLLM`` / None.
active_state: Unused; kept for call-site compatibility.
"""
if isinstance(backend_type, str):
backend_type = BackendType(backend_type)
if backend_type not in (None, BackendType.VLLM):
raise ValueError(f"unsupported backend {backend_type!r}: the stack is k3s vLLM-only")
return VllmBackend(active_state=active_state)
"""vLLM backend: a thin client for the vLLM pod running in k3s.
The stack is k3s-only. vLLM runs as the ``autosre-vllm-local`` Deployment and
is reached at the pinned ClusterIP (``config.get_vllm_url()``); Claude Code talks
to the ``autosre-proxy`` pod (``config.get_proxy_url()``), which translates the
Anthropic Messages API to vLLM's OpenAI Chat Completions. This class only
resolves URLs, checks health, and discovers the proxy's model label - it does
not start or stop anything (deploy with ``autosre k3s up``, scale with
``autosre start`` / ``autosre stop``).
"""
from __future__ import annotations
import logging
import os
from typing import Any, ClassVar
import httpx
from autosre.config import get_codex_url, get_proxy_url, get_vllm_url
from . import vllm_serve
from .base import Backend
logger = logging.getLogger(__name__)
class VllmBackend(Backend):
"""vLLM-in-k3s client (URL + health + model-label discovery)."""
name: ClassVar[str] = "vllm"
description: ClassVar[str] = "vLLM on GB10 (k3s pod, NVFP4)"
api_port: ClassVar[int] = 8010
proxy_port: ClassVar[int] = 8011 # Anthropic API proxy port (for Claude Code)
codex_proxy_port: ClassVar[int] = 8012 # OpenAI Responses-API proxy port (for Codex CLI)
# Single canonical recipe served by the k3s pod (see helm/autosre/values.yaml,
# the runtime source of truth). The multimodal recipe is selected by editing
# helm values + `autosre k3s up`, not a live model swap.
models: ClassVar[dict[str, str]] = {
"""Base backend interface. k3s-only: vLLM is the single backend."""
import os
from abc import ABC, abstractmethod
from enum import Enum
from pathlib import Path
from typing import ClassVar
class BackendType(Enum):
"""Available backend types. The stack is k3s vLLM-only."""
VLLM = "vllm"
def detect_platform() -> BackendType:
"""The only backend is vLLM-in-k3s."""
return BackendType.VLLM
class Backend(ABC):
"""Abstract base class for inference backends."""
name: ClassVar[str] = "base"
description: ClassVar[str] = "Base backend"
# Default API port
api_port: ClassVar[int] = 8080
# Default model configurations
models: ClassVar[dict[str, str]] = {}
default_model: ClassVar[str] = ""
def __init__(self, active_state: dict[str, object] | None = None) -> None:
"""Initialize the backend.
Args:
active_state: Accepted for call-site compatibility; unused in k3s
mode (state lives in the cluster, not on the host).
"""
"""Configuration for the vLLM backend on GB10 nodes."""
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path # noqa: TC003 - used at runtime in classmethods
from autosre.infra.config import DATA_DIR, load_yaml, save_yaml
from autosre.infra.types import GB10Node, NodeRole
@dataclass
class VllmConfig:
"""Configuration for vLLM deployment on GB10 nodes.
Persisted at ~/.local/share/autosre/vllm.yaml.
Created by `autosre configure vllm --node <ip> [--node <ip>]`.
"""
nodes: list[GB10Node]
docker_image: str = "vllm/vllm-openai:latest"
docker_image_fallback: str = "vllm/vllm-openai:nightly"
hf_cache_dir: str = "/data/huggingface"
nccl_socket_ifname: str = "enp1s0f0np0"
@classmethod
def default_path(cls) -> Path:
"""Default config file path."""
return DATA_DIR / "vllm.yaml"
@classmethod
def load(cls, path: Path | None = None) -> VllmConfig:
"""Load config from YAML file.
Args:
path: Path to config file. None = default path.
Raises:
FileNotFoundError: If config file doesn't exist.
ValueError: If config is invalid (no nodes defined).