Skip to content

Infrastructure & Provisioning

The autosre repository manages the lifecycle of GB10 nodes through a layered approach: low-level node provisioning, SSH-based transport, and optional K3s cluster orchestration. The system distinguishes between the vLLM serving path, which relies on direct SSH+Docker execution, and an optional enterprise management overlay that bootstraps a K3s cluster for unified control plane operations. Infrastructure state is persisted via local backups and synchronized across nodes using high-speed RDMA links, ensuring that service degradation is minimized during maintenance windows.

The cluster lifecycle is managed by the ClusterManager class, which orchestrates a 2-node K3s cluster on GB10 nodes . This management layer is optional and distinct from the primary vLLM serving path, which uses SSH and Docker directly 1. The cluster configuration specifies the Docker runtime (--docker), disables Traefik and ServiceLB, and targets specific NVIDIA operator versions for GPU and network management 2.

The bootstrap process follows a strict sequence: installing the K3s server on the head node, retrieving the join token, joining agent nodes, and deploying the NVIDIA GPU and Network Operators. The GPU Operator is configured with driver.enabled=false because the DGX OS ships with pre-installed drivers, while the Network Operator is deployed to enable RDMA over ConnectX-7. Teardown operations remove K3s components from all nodes using specific uninstall scripts 3.

diagram

Node provisioning handles the transition from a vanilla DGX OS to a production-ready state, supporting wipe-and-rebuild scenarios and rolling rebuilds 4. The Provisioner class manages the technical steps of backup, image saving, and restoration, while the NodeLifecycle class orchestrates the rolling rebuild strategy.

During a rolling rebuild, the system ensures service availability by degrading to a solo model on the surviving node if a TP=2 model is running 5. The process involves syncing models and Docker images to the surviving node, performing a pre-wipe backup, and then waiting for the user to physically wipe the target node. After the node is reimaged and SSH access is restored, the system performs a post-wipe restore, loads saved Docker images, and validates the node 6.

A full wipe of all nodes is also supported, which preserves data on /data/ partitions but requires a complete re-bootstrap of the K3s cluster and re-download of all models.

All infrastructure operations, including cluster management and provisioning, are executed via SSH 2. The SSHRunner class abstracts the remote execution layer, handling connectivity checks and command execution on GB10 nodes.

For cluster operations, the ClusterManager initializes an SSHRunner for the head node and creates new runners for worker nodes as needed. Commands such as kubectl, helm, and K3s installation scripts are executed remotely through this transport. The K3s installation script itself is designed to be robust against flaky links, using retry logic and checksum verification for the binary download.

diagram

Configuration and state persistence are handled through local backups and cross-node synchronization. The Provisioner class manages pre-wipe backups and Docker image saving to ensure state can be restored after a wipe 6.

Model data is persisted in /data/huggingface/ and synchronized between nodes using rsync over the ConnectX-7 network, which provides high-speed transfer capabilities (~185 Gbps). This synchronization is critical during rolling rebuilds to ensure the surviving node has the necessary model data to continue serving. Docker images are saved to /data/docker-images/ and transferred between nodes using the same mechanism.

Cluster status information, including node readiness, GPU operator status, and network operator status, is aggregated into a ClusterStatus dataclass, which provides a comprehensive view of the cluster’s health 7.