Infrastructure & Deployment
The meetingscribe infrastructure layer manages the lifecycle of services on GB10 hardware, primarily targeting k3s-based deployments. It provides a unified interface for executing commands locally or via SSH, prefetching HuggingFace model weights to ensure offline pod availability, and verifying service health through HTTP polling. This section details the components responsible for container orchestration support, model caching, and system readiness checks.
Cluster Management and Command Execution
Section titled “Cluster Management and Command Execution”For container management, the LocalRunner exposes methods to interact with Docker, including docker_run, docker_stop, docker_remove, and docker_restart 1. The docker_run method configures containers with specific defaults suitable for ML workloads, such as host networking, GPU access, and shared memory limits. The docker_restart method is specifically used to recover containers where the process is alive but the CUDA context is corrupted, offering a faster recovery than stopping and starting the container 2.
HuggingFace Model Prefetching
Section titled “HuggingFace Model Prefetching”The pull_models function in src/meeting_scribe/infra/containers.py is responsible for downloading HuggingFace model weights to the GB10’s shared host cache (/data/huggingface) 3. This pre-flight step ensures that offline k3s pods can find the required models.
For local execution, the function uses the huggingface_hub.snapshot_download API directly. This approach is chosen because the shipped hf CLI is unreliable for scripting due to environment path issues and exit code leaks. The legacy huggingface-cli is also avoided as it acts as a no-op shim that silently performs empty pulls. Any failure during the download raises an exception to prevent partial or empty pulls, which would cause pods to crash-loop with LocalEntryNotFoundError.
For remote SSH targets, the function executes the hf download command over SSH. This ensures the models are downloaded to the remote GB10 node rather than the local machine. The function manages the HF_TOKEN environment variable to handle gated repositories.
Service Health Checking
Section titled “Service Health Checking”The check_service function handles the polling logic with configurable timeouts and retry intervals 4. It returns True if the service responds with a 200 status code. The check_all_services function checks multiple services concurrently using asyncio.gather 5. When wait=True, all services share a total timeout deadline, ensuring the total wait time is determined by the slowest service rather than the sum of all services.
If a service is healthy, the checker also attempts to retrieve the loaded model ID from the /v1/models endpoint 4. This information is included in the ServiceStatus dataclass.
Privileged Helper Daemon
Section titled “Privileged Helper Daemon”The provided sources do not contain information about a privileged helper daemon for root-level operations. The LocalRunner and SSHRunner abstractions handle command execution, but the specific implementation of a privileged daemon is not described in the referenced files.
"""Local command runner - mirrors SSHRunner but executes via subprocess.
Used when meeting-scribe runs on the GB10 itself (the common dev setup).
"""
from __future__ import annotations
import logging
import subprocess
logger = logging.getLogger(__name__)
class LocalRunner:
"""Execute commands on the local host with the same surface as SSHRunner."""
def __init__(self) -> None:
self.node = None
@property
def ssh_target(self) -> str:
return "local"
def run(
self,
cmd: list[str],
timeout: int = 30,
check: bool = True,
) -> subprocess.CompletedProcess[str]:
logger.debug("LOCAL: %s", " ".join(cmd))
return subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=timeout,
check=check,
)
def run_bg(self, cmd: list[str]) -> str:
proc = subprocess.Popen(
def docker_container_exists(self, name: str) -> tuple[bool, bool]:
"""Return (exists, running). Used by start_container() to pick
between `docker start` (existing) and `docker run` (fresh)."""
result = self.run(
["docker", "inspect", "--format", "{{.State.Running}}", name],
timeout=10,
check=False,
)
if result.returncode != 0:
return (False, False)
running = result.stdout.strip().lower() == "true"
return (True, running)
def docker_start(self, container_id: str) -> bool:
result = self.run(
["docker", "start", container_id],
timeout=30,
check=False,
)
return result.returncode == 0
def docker_remove(self, container_id: str, force: bool = True) -> bool:
args = ["docker", "rm"]
if force:
args.append("-f")
args.append(container_id)
result = self.run(args, timeout=30, check=False)
return result.returncode == 0
def docker_restart(self, container_id: str, timeout: int = 30) -> bool:
"""Restart a running container in place (single docker restart).
Used to recover a container whose process is alive but whose
CUDA context has been corrupted (e.g. pyannote after concurrent
calls wedged the GPU). Much faster than docker stop + up for
a single container.
"""
result = self.run(
["docker", "restart", "-t", str(timeout), container_id],
"""HuggingFace model pre-fetch for the GB10 model stack (k3s-only).
The backends run as k3s pods (`helm/meeting-scribe`); this module keeps the one
pre-flight helper that isn't k3s's job - pulling HuggingFace model weights into
the shared host cache (`/data/huggingface`) so the offline pods find them. The
container-lifecycle helpers were removed with docker-compose (k3s owns the pods).
"""
from __future__ import annotations
import logging
import os
from pathlib import Path
from typing import TYPE_CHECKING
from meeting_scribe import paths
if TYPE_CHECKING:
from meeting_scribe.infra.local import LocalRunner
from meeting_scribe.infra.ssh import SSHRunner
# Either runner works - both expose `run`, `rsync`, etc.
Runner = LocalRunner | SSHRunner
else:
Runner = object
logger = logging.getLogger(__name__)
def pull_models(
ssh: Runner,
model_ids: list[str],
hf_cache_dir: str = str(paths.DEFAULT_HF_CACHE_DIR),
*,
hf_token: str | None = None,
) -> None:
"""Download models to the GB10's HuggingFace cache.
For a LOCAL target the download runs in-process via
``huggingface_hub.snapshot_download`` - NOT the ``hf`` CLI. The shipped
"""Service health checking with retry and timeout.
Polls HTTP health endpoints until they respond or timeout is reached.
Pattern adapted from auto-sre's _wait_for_vllm().
"""
from __future__ import annotations
import asyncio
import logging
from dataclasses import dataclass
import httpx
logger = logging.getLogger(__name__)
# Default service ports (host networking, no port mapping)
SERVICE_PORTS = {
"translation": 8010,
"diarization": 8001,
"tts": 8002,
"asr": 8003,
}
@dataclass
class ServiceStatus:
"""Health status of a single service."""
name: str
url: str
healthy: bool
model: str | None = None
error: str | None = None
async def check_service(
url: str,
*,
timeout: float = 5.0,
shared ``total_timeout`` deadline. Total wait = max(slowest_service)
instead of sum(all_services).
Args:
host: GB10 IP or hostname.
ports: Override default service ports.
wait: If True, wait for each service to become healthy.
total_timeout: Max wait time (shared across all services when
wait=True).
Returns:
Dict mapping service name to ServiceStatus.
"""
svc_ports = ports or SERVICE_PORTS
tasks = [
_check_one_service(name, port, host, wait=wait, total_timeout=total_timeout)
for name, port in svc_ports.items()
]
pairs = await asyncio.gather(*tasks)
return {name: status for name, status in pairs}