Services & Utilities

The autosre/services package encapsulates non-LLM auxiliary services that operate alongside the primary vLLM stack. While the backends/ subdirectory is reserved for components serving /v1/chat/completions, this package houses all other lifecycle-managed services, including browserless clients and future TLS proxies ¹. These services are integrated into the autosre start and autosre stop lifecycle, ensuring they are managed consistently with the rest of the system.

Browserless Client

The browserless.py module provides a thin client wrapper for the autosre-browser k3s Deployment, which runs Chromium ². This service is designed for k3s-only environments where the MCP server (autosre/mcp_servers/browser.py) connects to the pod directly over CDP, bypassing host container lifecycle management. The browser pod is managed by Helm via autosre k3s up and is accessed through a pinned ClusterIP obtained via config.get_browser_url().

The BrowserlessService class handles WebSocket URL construction for CDP connections. It derives the scheme (ws or wss) from the ClusterIP configuration and optionally appends authentication tokens from the AUTOSRE_BROWSER_TOKEN environment variable. Recording is opt-in; passing record=True to cdp_ws_url appends record=true to the query parameters, causing the browserless pod to write finished .webm files to /recordings. These files are hostPath-mounted to a directory discoverable by the host MCP, with the local path configurable via AUTOSRE_BROWSER_RECORDING_DIR.

The service also includes a garbage collection mechanism via prune_recordings, which deletes .webm files older than a configurable TTL (defaulting to 24 hours or the AUTOSRE_BROWSER_RECORDING_TTL environment variable).

Model Benchmarking Harness

The bench.py module provides a harness for benchmarking vLLM models on local GPUs, measuring single-request throughput (TTFT and decode TPS), concurrent throughput, tool calling validation, and memory usage ³. Results are persisted as JSON files in ~/.local/share/autosre/benchmarks/.

The harness employs a “shared-port” strategy to avoid interfering with live model serving. Before benchmarking, it probes port 8010 via _probe_live_model. If a model is already serving on that port, the harness checks if it matches the requested spec. If it matches, the benchmark runs in “live” mode against the existing container; if it differs, the harness refuses to proceed to prevent benchmarking the wrong model. If port 8010 is idle, the harness spins up its own autosre-bench Docker container using the ghcr.io/bjk110/vllm-spark:turboquant image. This approach prevents historical issues where the bench container could not bind the port held by the live server, which previously caused false “Ready in 1s + 0 tok/s” health check passes.

In k3s mode, the harness cannot start a host container because the vLLM service is backed by a ClusterIP that a host container cannot answer. In this scenario, the harness refuses to run unless the autosre-vllm-local pod is scaled up and serving on the service endpoint.

For reasoning models like Qwen3.6 and Nemotron-nano-v3, the harness counts tokens from content, reasoning, and reasoning_content deltas to accurately measure throughput, as these models stream tokens on the reasoning channel. The benchmark suite includes predefined models such as Nemotron-Nano-30B NVFP4 and Qwen3.6-35B-A3B NVFP4, each with specific vLLM arguments for load formats and caching.

autosre/services/__init__.py L1-7

"""Non-LLM auxiliary services managed alongside autosre's vLLM stack.

`backends/` is reserved for components that serve `/v1/chat/completions`.
Everything else that participates in the `autosre start` / `autosre stop`
lifecycle (browserless, future TLS proxies, etc.) lives here.
"""

autosre/services/browserless.py L1-107 (showing 40 of 107)

"""browserless/chromium client for autosre's Browser Run MCP (k3s-only).

browserless runs as the ``autosre-browser`` k3s Deployment at the pinned
ClusterIP (``config.get_browser_url()``). There is no host container lifecycle:
the MCP server (``autosre/mcp_servers/browser.py``) just connects to the pod
over CDP. Recording is opt-in per ``browser_session_open(record=True)``; the pod
writes finished ``.webm`` files to ``RECORDING_PATH=/recordings``, which the
chart hostPath-mounts to ``record_dir`` so the host MCP can discover them.
"""

from __future__ import annotations

import os
import stat
import time
from typing import TYPE_CHECKING, ClassVar
from urllib.parse import urlsplit

from autosre import paths
from autosre.config import get_browser_url

if TYPE_CHECKING:
    from pathlib import Path


def _env_int(key: str, default: int) -> int:
    raw = os.environ.get(key)
    if not raw:
        return default
    try:
        return int(raw)
    except ValueError:
        return default


class BrowserlessService:
    """Thin client for the ``autosre-browser`` k3s pod.

    No docker lifecycle: the pod is managed by helm (``autosre k3s up``). The
    MCP server uses ``cdp_ws_url`` to connect over CDP, ``record_dir`` to find

autosre/bench.py L1-120 (showing 40 of 120)

"""Model benchmarking for autosre.

Benchmarks vLLM models on the local GPU:
  - Single-request throughput (TTFT + decode TPS)
  - Concurrent throughput (aggregate TPS at N parallel)
  - Tool calling validation
  - Memory usage

Results are saved to ~/.local/share/autosre/benchmarks/.

SHARED-PORT BEHAVIOR: `autosre bench` probes :8010 before touching any
container (`_probe_live_model`). If a model is already serving there:
  - it matches the requested spec  → benchmark it in place (mode="live"),
    never stop/start a container;
  - it differs                     → refuse with an actionable message
    rather than silently benchmarking the wrong model.
Only when :8010 is idle does bench spin up its own `autosre-bench`
container. This both fixes the historical "Ready in 1s + 0 tok/s" failure
(where the bench container couldn't bind the port the live server held, so
health passed instantly against the live server) and respects the rule
that the live `autosre-vllm-local` is never `docker rm`'d out from under
a running session.

REASONING MODELS: throughput is measured by counting every generated
streaming delta - `content`, `reasoning`, and `reasoning_content`. Models
like Qwen3.6 / Nemotron-nano-v3 emit their tokens on the `reasoning`
channel, so counting only `content` (the pre-2026-05 behavior) recorded
0 tok/s for them despite healthy generation.
"""

from __future__ import annotations

import json
import os
import subprocess
import time
from dataclasses import dataclass
from pathlib import Path
from typing import Any