Services & Utilities
The autosre/services package encapsulates non-LLM auxiliary services that operate alongside the primary vLLM stack. While the backends/ subdirectory is reserved for components serving /v1/chat/completions, this package houses all other lifecycle-managed services, including browserless clients and future TLS proxies 1. These services are integrated into the autosre start and autosre stop lifecycle, ensuring they are managed consistently with the rest of the system.
Browserless Client
Section titled “Browserless Client”The browserless.py module provides a thin client wrapper for the autosre-browser k3s Deployment, which runs Chromium 2. This service is designed for k3s-only environments where the MCP server (autosre/mcp_servers/browser.py) connects to the pod directly over CDP, bypassing host container lifecycle management. The browser pod is managed by Helm via autosre k3s up and is accessed through a pinned ClusterIP obtained via config.get_browser_url().
The BrowserlessService class handles WebSocket URL construction for CDP connections. It derives the scheme (ws or wss) from the ClusterIP configuration and optionally appends authentication tokens from the AUTOSRE_BROWSER_TOKEN environment variable. Recording is opt-in; passing record=True to cdp_ws_url appends record=true to the query parameters, causing the browserless pod to write finished .webm files to /recordings. These files are hostPath-mounted to a directory discoverable by the host MCP, with the local path configurable via AUTOSRE_BROWSER_RECORDING_DIR.
The service also includes a garbage collection mechanism via prune_recordings, which deletes .webm files older than a configurable TTL (defaulting to 24 hours or the AUTOSRE_BROWSER_RECORDING_TTL environment variable).
Model Benchmarking Harness
Section titled “Model Benchmarking Harness”The bench.py module provides a harness for benchmarking vLLM models on local GPUs, measuring single-request throughput (TTFT and decode TPS), concurrent throughput, tool calling validation, and memory usage 3. Results are persisted as JSON files in ~/.local/share/autosre/benchmarks/.
The harness employs a “shared-port” strategy to avoid interfering with live model serving. Before benchmarking, it probes port 8010 via _probe_live_model. If a model is already serving on that port, the harness checks if it matches the requested spec. If it matches, the benchmark runs in “live” mode against the existing container; if it differs, the harness refuses to proceed to prevent benchmarking the wrong model. If port 8010 is idle, the harness spins up its own autosre-bench Docker container using the ghcr.io/bjk110/vllm-spark:turboquant image. This approach prevents historical issues where the bench container could not bind the port held by the live server, which previously caused false “Ready in 1s + 0 tok/s” health check passes.
In k3s mode, the harness cannot start a host container because the vLLM service is backed by a ClusterIP that a host container cannot answer. In this scenario, the harness refuses to run unless the autosre-vllm-local pod is scaled up and serving on the service endpoint.
For reasoning models like Qwen3.6 and Nemotron-nano-v3, the harness counts tokens from content, reasoning, and reasoning_content deltas to accurately measure throughput, as these models stream tokens on the reasoning channel. The benchmark suite includes predefined models such as Nemotron-Nano-30B NVFP4 and Qwen3.6-35B-A3B NVFP4, each with specific vLLM arguments for load formats and caching.
"""Non-LLM auxiliary services managed alongside autosre's vLLM stack.
`backends/` is reserved for components that serve `/v1/chat/completions`.
Everything else that participates in the `autosre start` / `autosre stop`
lifecycle (browserless, future TLS proxies, etc.) lives here.
"""
"""browserless/chromium client for autosre's Browser Run MCP (k3s-only).
browserless runs as the ``autosre-browser`` k3s Deployment at the pinned
ClusterIP (``config.get_browser_url()``). There is no host container lifecycle:
the MCP server (``autosre/mcp_servers/browser.py``) just connects to the pod
over CDP. Recording is opt-in per ``browser_session_open(record=True)``; the pod
writes finished ``.webm`` files to ``RECORDING_PATH=/recordings``, which the
chart hostPath-mounts to ``record_dir`` so the host MCP can discover them.
"""
from __future__ import annotations
import os
import stat
import time
from typing import TYPE_CHECKING, ClassVar
from urllib.parse import urlsplit
from autosre import paths
from autosre.config import get_browser_url
if TYPE_CHECKING:
from pathlib import Path
def _env_int(key: str, default: int) -> int:
raw = os.environ.get(key)
if not raw:
return default
try:
return int(raw)
except ValueError:
return default
class BrowserlessService:
"""Thin client for the ``autosre-browser`` k3s pod.
No docker lifecycle: the pod is managed by helm (``autosre k3s up``). The
MCP server uses ``cdp_ws_url`` to connect over CDP, ``record_dir`` to find
"""Model benchmarking for autosre.
Benchmarks vLLM models on the local GPU:
- Single-request throughput (TTFT + decode TPS)
- Concurrent throughput (aggregate TPS at N parallel)
- Tool calling validation
- Memory usage
Results are saved to ~/.local/share/autosre/benchmarks/.
SHARED-PORT BEHAVIOR: `autosre bench` probes :8010 before touching any
container (`_probe_live_model`). If a model is already serving there:
- it matches the requested spec → benchmark it in place (mode="live"),
never stop/start a container;
- it differs → refuse with an actionable message
rather than silently benchmarking the wrong model.
Only when :8010 is idle does bench spin up its own `autosre-bench`
container. This both fixes the historical "Ready in 1s + 0 tok/s" failure
(where the bench container couldn't bind the port the live server held, so
health passed instantly against the live server) and respects the rule
that the live `autosre-vllm-local` is never `docker rm`'d out from under
a running session.
REASONING MODELS: throughput is measured by counting every generated
streaming delta - `content`, `reasoning`, and `reasoning_content`. Models
like Qwen3.6 / Nemotron-nano-v3 emit their tokens on the `reasoning`
channel, so counting only `content` (the pre-2026-05 behavior) recorded
0 tok/s for them despite healthy generation.
"""
from __future__ import annotations
import json
import os
import subprocess
import time
from dataclasses import dataclass
from pathlib import Path
from typing import Any