Skip to content

Infrastructure & Deployment

The meetingscribe infrastructure layer manages the lifecycle of services on GB10 hardware, primarily targeting k3s-based deployments. It provides a unified interface for executing commands locally or via SSH, prefetching HuggingFace model weights to ensure offline pod availability, and verifying service health through HTTP polling. This section details the components responsible for container orchestration support, model caching, and system readiness checks.

For container management, the LocalRunner exposes methods to interact with Docker, including docker_run, docker_stop, docker_remove, and docker_restart 1. The docker_run method configures containers with specific defaults suitable for ML workloads, such as host networking, GPU access, and shared memory limits. The docker_restart method is specifically used to recover containers where the process is alive but the CUDA context is corrupted, offering a faster recovery than stopping and starting the container 2.

diagram

The pull_models function in src/meeting_scribe/infra/containers.py is responsible for downloading HuggingFace model weights to the GB10’s shared host cache (/data/huggingface) 3. This pre-flight step ensures that offline k3s pods can find the required models.

For local execution, the function uses the huggingface_hub.snapshot_download API directly. This approach is chosen because the shipped hf CLI is unreliable for scripting due to environment path issues and exit code leaks. The legacy huggingface-cli is also avoided as it acts as a no-op shim that silently performs empty pulls. Any failure during the download raises an exception to prevent partial or empty pulls, which would cause pods to crash-loop with LocalEntryNotFoundError.

For remote SSH targets, the function executes the hf download command over SSH. This ensures the models are downloaded to the remote GB10 node rather than the local machine. The function manages the HF_TOKEN environment variable to handle gated repositories.

diagram

The check_service function handles the polling logic with configurable timeouts and retry intervals 4. It returns True if the service responds with a 200 status code. The check_all_services function checks multiple services concurrently using asyncio.gather 5. When wait=True, all services share a total timeout deadline, ensuring the total wait time is determined by the slowest service rather than the sum of all services.

If a service is healthy, the checker also attempts to retrieve the loaded model ID from the /v1/models endpoint 4. This information is included in the ServiceStatus dataclass.

diagram

The provided sources do not contain information about a privileged helper daemon for root-level operations. The LocalRunner and SSHRunner abstractions handle command execution, but the specific implementation of a privileged daemon is not described in the referenced files.