Skip to content

Services & Utilities

The autosre/services package encapsulates non-LLM auxiliary services that operate alongside the primary vLLM stack. While the backends/ subdirectory is reserved for components serving /v1/chat/completions, this package houses all other lifecycle-managed services, including browserless clients and future TLS proxies 1. These services are integrated into the autosre start and autosre stop lifecycle, ensuring they are managed consistently with the rest of the system.

The browserless.py module provides a thin client wrapper for the autosre-browser k3s Deployment, which runs Chromium 2. This service is designed for k3s-only environments where the MCP server (autosre/mcp_servers/browser.py) connects to the pod directly over CDP, bypassing host container lifecycle management. The browser pod is managed by Helm via autosre k3s up and is accessed through a pinned ClusterIP obtained via config.get_browser_url().

The BrowserlessService class handles WebSocket URL construction for CDP connections. It derives the scheme (ws or wss) from the ClusterIP configuration and optionally appends authentication tokens from the AUTOSRE_BROWSER_TOKEN environment variable. Recording is opt-in; passing record=True to cdp_ws_url appends record=true to the query parameters, causing the browserless pod to write finished .webm files to /recordings. These files are hostPath-mounted to a directory discoverable by the host MCP, with the local path configurable via AUTOSRE_BROWSER_RECORDING_DIR.

The service also includes a garbage collection mechanism via prune_recordings, which deletes .webm files older than a configurable TTL (defaulting to 24 hours or the AUTOSRE_BROWSER_RECORDING_TTL environment variable).

diagram

The bench.py module provides a harness for benchmarking vLLM models on local GPUs, measuring single-request throughput (TTFT and decode TPS), concurrent throughput, tool calling validation, and memory usage 3. Results are persisted as JSON files in ~/.local/share/autosre/benchmarks/.

The harness employs a “shared-port” strategy to avoid interfering with live model serving. Before benchmarking, it probes port 8010 via _probe_live_model. If a model is already serving on that port, the harness checks if it matches the requested spec. If it matches, the benchmark runs in “live” mode against the existing container; if it differs, the harness refuses to proceed to prevent benchmarking the wrong model. If port 8010 is idle, the harness spins up its own autosre-bench Docker container using the ghcr.io/bjk110/vllm-spark:turboquant image. This approach prevents historical issues where the bench container could not bind the port held by the live server, which previously caused false “Ready in 1s + 0 tok/s” health check passes.

In k3s mode, the harness cannot start a host container because the vLLM service is backed by a ClusterIP that a host container cannot answer. In this scenario, the harness refuses to run unless the autosre-vllm-local pod is scaled up and serving on the service endpoint.

For reasoning models like Qwen3.6 and Nemotron-nano-v3, the harness counts tokens from content, reasoning, and reasoning_content deltas to accurately measure throughput, as these models stream tokens on the reasoning channel. The benchmark suite includes predefined models such as Nemotron-Nano-30B NVFP4 and Qwen3.6-35B-A3B NVFP4, each with specific vLLM arguments for load formats and caching.

diagram