Performance & Benchmarking

The autosre performance benchmarking suite is a concurrent-workload regression harness designed to measure the stability and performance of the shared vLLM server. It operates by driving translation (priority=-10) and coding (priority=10) workloads simultaneously against the server endpoint at :8010, capturing detailed metrics such as per-workload TTFT/TPS percentiles and vLLM scheduler counters. These measurements are then compared against committed named baselines stored under repos/autosre/benchmarks/baselines/ to detect regressions or improvements in system behavior.

Benchmarking Harness and Workloads

The core execution logic is exposed through the run function in autosre/perf/harness, which orchestrates the benchmarking phases. The harness supports two primary workload types defined in autosre/perf/workloads: CODING_WORKLOAD and TRANSLATION_WORKLOAD. These workloads are executed concurrently to simulate realistic mixed-traffic scenarios. The harness records specific performance indicators, including Time-To-First-Token (TTFT) and Tokens-Per-Second (TPS) percentiles, alongside internal vLLM scheduler counters. The entrypoint for the module is autosre/perf/__init__.py, which aggregates the necessary classes and functions, including RunConfig, RunResult, and PhaseResult, to facilitate the execution and result handling.

Baseline Management

Baseline management is handled by the autosre/perf/baseline module, which provides functions to load, save, and compare performance data. The Baseline class represents a stored set of performance metrics, while the Violation class indicates when current performance deviates from the expected baseline. The compare function facilitates the comparison between current run results and committed baselines. Additionally, the module supports boot benchmarks through BootBaseline, BootResult, and BootViolation, managed by functions like compare_boot, load_boot_baseline, and save_boot_baseline. This structure allows for both standard performance regression testing and specific boot-time performance tracking.

Monitoring and Reporting

The harness captures detailed metrics during execution, including TTFT/TPS percentiles and vLLM scheduler counters. These metrics are processed and rendered for review using the autosre/perf/report module. The render_markdown and render_stdout functions allow for flexible output formatting, enabling engineers to view results in either human-readable markdown or plain text formats directly in the terminal. This reporting mechanism ensures that performance data is easily accessible for analysis and integration into CI/CD pipelines.

Smoke Testing

In addition to full regression benchmarks, the harness includes a smoke test capability via the run_smoke function in autosre/perf/smoke. This function returns a SmokeResult object, providing a quick check to ensure the vLLM server is responsive and capable of handling basic requests before running more comprehensive workload tests. This feature is useful for pre-flight checks or quick validation of server health.