Skip to content

Benchmarks & Spikes

The autosre repository includes a specialized benchmarking suite within benchmarks/spike_nvfp4_mtp designed to validate the fidelity of quantized models (specifically NVFP4) against high-precision references. This suite separates the concerns of infrastructure management and quality evaluation to ensure rigorous, reproducible results. It employs a dynamic staging deployment manager to handle single-tenant vLLM instances without interfering with production selectors, and a first-token distribution quality benchmark that prioritizes logit-cosine similarity over simple token agreement to detect distributional drift.

The staging infrastructure is managed by benchmarks/spike_nvfp4_mtp/staging.py, which provisions a throwaway, single-tenant vLLM deployment. This deployment mirrors the production helm/autosre/templates/deployment-vllm.yaml configuration - including runtimeClass nvidia, HF-cache hostPath, mandatory-offline environment variables, and /dev/shm - but uses a distinct name and label (app=spike-vllm) to avoid collisions with the production autosre-vllm-local selector, podAntiAffinity rules, or the pinned ClusterIP 10.96.0.30 1.

Crucially, the staging service uses a dynamically assigned ClusterIP rather than a port-forward. This ensures that the transport mechanism matches the production host-to-ClusterIP path, preventing the API-server SPDY hop from distorting Time-To-First-Token (TTFT) metrics. The system assumes the GPU has been vacated via autosre.dedicated.up_spike before applying a challenger.

diagram

The StagingSpec dataclass defines the challenger configuration, including the image reference, model ID, and vLLM serve arguments. The apply function generates a Kubernetes manifest and applies it, while cluster_ip and base_url retrieve the dynamically assigned network endpoint. The wait_ready function polls the /health endpoint on the dynamic ClusterIP until a 200 status is received or a timeout occurs. The teardown function deletes the deployment and service, waiting for the pod to fully terminate to ensure unified memory is released before the next challenger loads.

Additionally, the module provides utilities to parse vLLM boot logs. The parse_memory function extracts metrics such as model weights size, available KV cache, and maximum concurrency using regex patterns. The log_has_fp4_nan function checks for specific CUTLASS or flashinfer FP4 errors, such as flashinfer_mm_fp4 or NaN values, which indicate quantization failures.

The quality benchmark is implemented in benchmarks/spike_nvfp4_mtp/spike_quality.py. It addresses a critical rigor lesson from the FFAI-vs-vLLM benchmark: a quantization method might preserve the argmax token (top-1 agreement) while destroying the overall probability distribution, as evidenced by a low cosine similarity of 0.44 in a config that still passed argmax checks 2. Therefore, this benchmark gates on top-k logprob cosine similarity rather than relying solely on token agreement or BLEU scores.

The benchmark uses a fixed, deterministic set of prompts (FIRST_TOKEN_PROMPTS) spanning code generation, English-Japanese translation, reasoning, and prose. To avoid the “divergent-generated-history problem,” the evaluation compares only the distribution of the first generated token for each fixed prompt.

diagram

The capture_first_token function sends each prompt to a server with max_tokens=1, temperature=0.0, and top_logprobs=k (default 20), capturing the top-k logprobs for the first token. The compare_first_token function takes reference and challenger rows, aligns the top-k logprobs by token into a shared vocabulary (flooding missing entries with -30.0), and computes the cosine similarity between the two vectors. The output includes a summary with the mean and minimum cosine similarity, as well as the argmax agreement rate, and a per-prompt breakdown. This separation of generation and scoring allows the cosine computation to be performed off-GPU after the runs are captured.