Infrastructure & Provisioning

The autosre repository manages the lifecycle of GB10 nodes through a layered approach: low-level node provisioning, SSH-based transport, and optional K3s cluster orchestration. The system distinguishes between the vLLM serving path, which relies on direct SSH+Docker execution, and an optional enterprise management overlay that bootstraps a K3s cluster for unified control plane operations. Infrastructure state is persisted via local backups and synchronized across nodes using high-speed RDMA links, ensuring that service degradation is minimized during maintenance windows.

Cluster Lifecycle Management

The cluster lifecycle is managed by the ClusterManager class, which orchestrates a 2-node K3s cluster on GB10 nodes . This management layer is optional and distinct from the primary vLLM serving path, which uses SSH and Docker directly ¹. The cluster configuration specifies the Docker runtime (--docker), disables Traefik and ServiceLB, and targets specific NVIDIA operator versions for GPU and network management ².

The bootstrap process follows a strict sequence: installing the K3s server on the head node, retrieving the join token, joining agent nodes, and deploying the NVIDIA GPU and Network Operators. The GPU Operator is configured with driver.enabled=false because the DGX OS ships with pre-installed drivers, while the Network Operator is deployed to enable RDMA over ConnectX-7. Teardown operations remove K3s components from all nodes using specific uninstall scripts ³.

Node Provisioning

Node provisioning handles the transition from a vanilla DGX OS to a production-ready state, supporting wipe-and-rebuild scenarios and rolling rebuilds ⁴. The Provisioner class manages the technical steps of backup, image saving, and restoration, while the NodeLifecycle class orchestrates the rolling rebuild strategy.

During a rolling rebuild, the system ensures service availability by degrading to a solo model on the surviving node if a TP=2 model is running ⁵. The process involves syncing models and Docker images to the surviving node, performing a pre-wipe backup, and then waiting for the user to physically wipe the target node. After the node is reimaged and SSH access is restored, the system performs a post-wipe restore, loads saved Docker images, and validates the node ⁶.

A full wipe of all nodes is also supported, which preserves data on /data/ partitions but requires a complete re-bootstrap of the K3s cluster and re-download of all models.

SSH Transport

All infrastructure operations, including cluster management and provisioning, are executed via SSH ². The SSHRunner class abstracts the remote execution layer, handling connectivity checks and command execution on GB10 nodes.

For cluster operations, the ClusterManager initializes an SSHRunner for the head node and creates new runners for worker nodes as needed. Commands such as kubectl, helm, and K3s installation scripts are executed remotely through this transport. The K3s installation script itself is designed to be robust against flaky links, using retry logic and checksum verification for the binary download.

Configuration Persistence

Configuration and state persistence are handled through local backups and cross-node synchronization. The Provisioner class manages pre-wipe backups and Docker image saving to ensure state can be restored after a wipe ⁶.

Model data is persisted in /data/huggingface/ and synchronized between nodes using rsync over the ConnectX-7 network, which provides high-speed transfer capabilities (~185 Gbps). This synchronization is critical during rolling rebuilds to ensure the surviving node has the necessary model data to continue serving. Docker images are saved to /data/docker-images/ and transferred between nodes using the same mechanism.

Cluster status information, including node readiness, GPU operator status, and network operator status, is aggregated into a ClusterStatus dataclass, which provides a comprehensive view of the cluster’s health ⁷.

autosre/cluster/__init__.py L1-15

"""K3s cluster management for GB10 nodes.

Optional overlay for enterprise management demos.
NOT in the vLLM serving path (that uses SSH+Docker directly).
"""

from autosre.cluster.manager import ClusterManager
from autosre.cluster.status import ClusterStatus, NodeStatus

__all__ = [
    "ClusterManager",
    "ClusterStatus",
    "NodeStatus",
]

autosre/cluster/manager.py L1-120 (showing 40 of 120)

"""K3s cluster lifecycle manager for GB10 nodes.

Manages a 2-node k3s cluster with NVIDIA GPU and Network operators.
This is an OPTIONAL management overlay - vLLM serving uses SSH+Docker directly.

K3s specifics for GB10:
- Docker runtime (--docker), NOT containerd
- GPU Operator v25.10+ (driver.enabled=false, DGX OS has pre-installed driver)
- Network Operator for RDMA over ConnectX-7
- --disable=traefik (no ingress needed)
"""

from __future__ import annotations

import shlex
from typing import TYPE_CHECKING

import click

from autosre.cluster.status import ClusterStatus, NodeStatus
from autosre.infra.ssh import SSHRunner

if TYPE_CHECKING:
    import subprocess

    from autosre.backends.vllm_config import VllmConfig
    from autosre.infra.types import GB10Node

# K3s install configuration
K3S_VERSION = "v1.31.4+k3s1"
K3S_INSTALL_FLAGS = "--docker --write-kubeconfig-mode 644 --disable=traefik --disable=servicelb"

# NVIDIA operator versions
GPU_OPERATOR_VERSION = "v25.10.0"
NETWORK_OPERATOR_VERSION = "v25.4.0"


def _k3s_install_script(*, exec_flags: str, k3s_url: str = "", k3s_token: str = "") -> str:
    """Build a remote bash that stages + sha256-verifies the k3s binary, then runs
    the installer with INSTALL_K3S_SKIP_DOWNLOAD=true.

autosre/cluster/manager.py L121-240 (showing 40 of 120)

        click.echo("\n[4/5] Deploying NVIDIA GPU Operator...")
        if not self.deploy_gpu_operator():
            click.secho("GPU Operator deployment failed (non-fatal)", fg="yellow")

        # Step 5: Network Operator
        click.echo("\n[5/5] Deploying NVIDIA Network Operator...")
        if not self.deploy_network_operator():
            click.secho("Network Operator deployment failed (non-fatal)", fg="yellow")

        click.secho("\nCluster bootstrap complete!", fg="green", bold=True)
        return True

    def teardown(self) -> bool:
        """Remove k3s from all nodes."""
        click.echo("Tearing down k3s cluster...")

        for node in self.config.worker_nodes:
            runner = SSHRunner(node)
            click.echo(f"  Removing k3s agent on {node.hostname}...")
            runner.run(["/usr/local/bin/k3s-agent-uninstall.sh"], check=False, timeout=60)

        click.echo(f"  Removing k3s server on {self.config.head_node.hostname}...")
        self._head_ssh.run(["/usr/local/bin/k3s-uninstall.sh"], check=False, timeout=60)

        click.secho("Cluster torn down.", fg="green")
        return True

    def status(self) -> ClusterStatus:
        """Get cluster health and component status."""
        try:
            result = self._kubectl("get", "nodes", "-o", "wide", "--no-headers")
            if result.returncode != 0:
                return ClusterStatus(
                    cluster_ready=False,
                    k3s_server_running=False,
                    error="k3s server not running or kubectl failed",
                )

            nodes = []
            for line in result.stdout.strip().splitlines():

autosre/provision/__init__.py L1-16

"""GB10 node provisioning and lifecycle management.

Handles:
- Day-0 provisioning: vanilla DGX OS to production-ready
- Wipe & rebuild: repeatable clean-slate + restore
- Rolling rebuild: one node at a time with service degradation
"""

from autosre.provision.lifecycle import NodeLifecycle
from autosre.provision.provisioner import Provisioner

__all__ = [
    "NodeLifecycle",
    "Provisioner",
]

autosre/provision/lifecycle.py L1-120 (showing 40 of 120)

"""Rolling rebuild and lifecycle management for GB10 node clusters.

Handles wipe/rebuild while maintaining service availability
by degrading to a solo model during maintenance windows.
"""

from __future__ import annotations

from typing import TYPE_CHECKING

import click

from autosre.infra.ssh import SSHRunner
from autosre.infra.types import SOLO_FALLBACK_MODEL, NodeRole
from autosre.provision.provisioner import Provisioner

if TYPE_CHECKING:
    from autosre.infra.types import GB10Node


class NodeLifecycle:
    """Manages wipe/rebuild across a multi-node GB10 cluster.

    Key constraint: TP=2 models (nemotron-super, qwen3.6-122b, qwen3.6-397b)
    require both nodes. During rebuild of either node, service degrades to
    a solo model on the surviving node.
    """

    def __init__(self, nodes: list[GB10Node]) -> None:
        self.nodes = nodes

    @property
    def head_node(self) -> GB10Node:
        for node in self.nodes:
            if node.role is NodeRole.HEAD:
                return node
        return self.nodes[0]

    @property
    def worker_nodes(self) -> list[GB10Node]:

autosre/provision/lifecycle.py L121-207 (showing 40 of 87)


            # Step 6: Post-wipe restore + re-provision
            click.echo("  Restoring state and re-provisioning...")
            provisioner = Provisioner(node)
            if not provisioner.post_wipe_restore():
                click.secho(f"  Restore failed for {node.hostname}", fg="red")
                return False

            # Step 7: Load saved Docker images
            click.echo("  Loading saved Docker images...")
            provisioner.load_docker_images()

            # Step 8: Validate
            click.echo("  Validating...")
            ok, issues = provisioner.validate()
            if not ok:
                click.secho(f"  Validation failed: {issues}", fg="red")
                return False

            click.secho(f"  {node.hostname} rebuilt successfully!", fg="green")

        click.echo(f"\n{'═' * 50}")
        click.secho("All nodes rebuilt successfully!", fg="green", bold=True)
        click.echo(
            "Cluster models available again. Start with: autosre start -b vllm -m nemotron-super"
        )
        return True

    def sync_models(self, source: GB10Node, dest: GB10Node) -> bool:
        """Rsync HuggingFace models between nodes.

        Uses ConnectX-7 for high-speed transfer (~185 Gbps).
        """
        runner = SSHRunner(source)

        click.echo(f"  Syncing models: {source.ip} -> {dest.ip}")
        # rsync must run ON the source node so /data/huggingface/ resolves to the
        # source node's tree, not the operator machine's.
        result = runner.rsync_to_node(
            "/data/huggingface/",

autosre/cluster/status.py L1-32

"""Cluster status dataclasses."""

from __future__ import annotations

from dataclasses import dataclass, field


@dataclass
class NodeStatus:
    """Status of a single k3s node."""

    hostname: str
    ip: str
    ready: bool
    roles: list[str] = field(default_factory=list)
    k3s_version: str | None = None
    gpu_detected: bool = False


@dataclass
class ClusterStatus:
    """Overall k3s cluster status."""

    cluster_ready: bool
    k3s_server_running: bool
    nodes: list[NodeStatus] = field(default_factory=list)
    gpu_operator_ready: bool = False
    network_operator_ready: bool = False
    nccl_healthy: bool = False
    total_gpu_memory_gb: int = 0
    error: str | None = None