Benchmarks & Spikes

The autosre repository includes a specialized benchmarking suite within benchmarks/spike_nvfp4_mtp designed to validate the fidelity of quantized models (specifically NVFP4) against high-precision references. This suite separates the concerns of infrastructure management and quality evaluation to ensure rigorous, reproducible results. It employs a dynamic staging deployment manager to handle single-tenant vLLM instances without interfering with production selectors, and a first-token distribution quality benchmark that prioritizes logit-cosine similarity over simple token agreement to detect distributional drift.

Staging Deployment Manager

The staging infrastructure is managed by benchmarks/spike_nvfp4_mtp/staging.py, which provisions a throwaway, single-tenant vLLM deployment. This deployment mirrors the production helm/autosre/templates/deployment-vllm.yaml configuration - including runtimeClass nvidia, HF-cache hostPath, mandatory-offline environment variables, and /dev/shm - but uses a distinct name and label (app=spike-vllm) to avoid collisions with the production autosre-vllm-local selector, podAntiAffinity rules, or the pinned ClusterIP 10.96.0.30 ¹.

Crucially, the staging service uses a dynamically assigned ClusterIP rather than a port-forward. This ensures that the transport mechanism matches the production host-to-ClusterIP path, preventing the API-server SPDY hop from distorting Time-To-First-Token (TTFT) metrics. The system assumes the GPU has been vacated via autosre.dedicated.up_spike before applying a challenger.

The StagingSpec dataclass defines the challenger configuration, including the image reference, model ID, and vLLM serve arguments. The apply function generates a Kubernetes manifest and applies it, while cluster_ip and base_url retrieve the dynamically assigned network endpoint. The wait_ready function polls the /health endpoint on the dynamic ClusterIP until a 200 status is received or a timeout occurs. The teardown function deletes the deployment and service, waiting for the pod to fully terminate to ensure unified memory is released before the next challenger loads.

Additionally, the module provides utilities to parse vLLM boot logs. The parse_memory function extracts metrics such as model weights size, available KV cache, and maximum concurrency using regex patterns. The log_has_fp4_nan function checks for specific CUTLASS or flashinfer FP4 errors, such as flashinfer_mm_fp4 or NaN values, which indicate quantization failures.

First-Token Distribution Quality

The quality benchmark is implemented in benchmarks/spike_nvfp4_mtp/spike_quality.py. It addresses a critical rigor lesson from the FFAI-vs-vLLM benchmark: a quantization method might preserve the argmax token (top-1 agreement) while destroying the overall probability distribution, as evidenced by a low cosine similarity of 0.44 in a config that still passed argmax checks ². Therefore, this benchmark gates on top-k logprob cosine similarity rather than relying solely on token agreement or BLEU scores.

The benchmark uses a fixed, deterministic set of prompts (FIRST_TOKEN_PROMPTS) spanning code generation, English-Japanese translation, reasoning, and prose. To avoid the “divergent-generated-history problem,” the evaluation compares only the distribution of the first generated token for each fixed prompt.

The capture_first_token function sends each prompt to a server with max_tokens=1, temperature=0.0, and top_logprobs=k (default 20), capturing the top-k logprobs for the first token. The compare_first_token function takes reference and challenger rows, aligns the top-k logprobs by token into a shared vocabulary (flooding missing entries with -30.0), and computes the cosine similarity between the two vectors. The output includes a summary with the mean and minimum cosine similarity, as well as the argmax agreement rate, and a per-prompt breakdown. This separation of generation and scoring allows the cosine computation to be performed off-GPU after the runs are captured.

benchmarks/spike_nvfp4_mtp/staging.py L1-120 (showing 40 of 120)

"""Throwaway single-tenant vLLM staging Deployment for the NVFP4/MTP spike.

Mirrors ``helm/autosre/templates/deployment-vllm.yaml`` (runtimeClass nvidia, HF-cache
hostPath, mandatory-offline env, /dev/shm) but with a DISTINCT name/label so it never
collides with the production ``autosre-vllm-local`` selector, its podAntiAffinity, or the
pinned ClusterIP ``10.96.0.30``. The challenger is reached over a DYNAMICALLY-assigned
ClusterIP (the exact host->ClusterIP transport production uses), never a port-forward,
whose API-server SPDY hop would distort TTFT.

Privileged cluster I/O goes through ``autosre.k3s_cmd`` (same seam the dedicated-mode
machinery uses). Single-tenant only: this assumes the GPU was already vacated via
``autosre.dedicated.up_spike`` before a challenger is applied.
"""

from __future__ import annotations

import json
import re
import tempfile
import time
import urllib.request
from dataclasses import dataclass, field
from pathlib import Path

from autosre import k3s_cmd

NS = "autosre"
NAME = "spike-vllm"
PORT = 8015
HF_HOSTPATH = "/data/huggingface"
_SELECTOR = f"app={NAME}"


def _kubectl(*args: str, timeout: int = 120) -> tuple[int, str]:
    r = k3s_cmd.kubectl(*args, timeout=timeout)
    return r.returncode, (r.stdout + r.stderr).strip()


@dataclass
class StagingSpec:

benchmarks/spike_nvfp4_mtp/spike_quality.py L1-120 (showing 40 of 120)

"""First-token logit-cosine + argmax agreement between a challenger and the FP8 reference.

The FFAI-vs-vLLM benchmark's key rigor lesson: a quantization can preserve the argmax
token yet destroy the distribution (they measured cosine 0.44 on a config that still
passed argmax). So we gate on the top-k logprob COSINE, not just token agreement or BLEU.

Well-defined form (avoids the divergent-generated-history problem): compare only the
FIRST generated token's distribution for a fixed prompt set. ``top_logprobs`` from each
server are aligned by token into a shared vocab (missing entries floored) before cosine.
Generation and scoring are separated: capture runs while each vLLM is up; the cosine is
computed later, off the GPU.
"""

from __future__ import annotations

import json
import math
import urllib.request
from typing import Any

# Fixed deterministic prompt set spanning code, EN<->JA translation, reasoning, prose.
FIRST_TOKEN_PROMPTS = [
    "Write a Python function that returns the nth Fibonacci number.",
    "Translate to Japanese: The quarterly revenue exceeded expectations.",
    "翻訳してください（英語へ）: 会議の議事録を送ってください。",
    "What is the capital of France?",
    "Explain what a hash map is in one sentence.",
    "Continue the sequence: 2, 4, 8, 16,",
    "Fix the bug: def add(a, b): return a - b",
    "Summarize in one line: The GB10 is a Grace-Blackwell unified-memory system.",
    "What does the SQL keyword JOIN do?",
    "Name a primary color.",
    "Convert 100 Fahrenheit to Celsius (just the number).",
    "Write a haiku about autumn.",
    "What is 17 multiplied by 23?",
    "Give the git command to create a new branch.",
    "Translate to English: 明日の天気は晴れです。",
    "Complete: The mitochondria is the powerhouse of the",
]