Skip to content

External AI Services

The meetingscribe project utilizes a standalone FastAPI service for speaker diarization, implemented in containers/pyannote/server.py 1. This service wraps the pyannote/speaker-diarization-community-1 pipeline, which was promoted to production on 2026-04-28 after achieving 100% time-weighted overlap resolution on a 40-minute, 4-speaker meeting benchmark. The service is containerized and runs on NVIDIA GB10 (aarch64, Blackwell) hardware, requiring specific SM_121 compatibility patches to bypass PyTorch architecture checks. To ensure stability on shared GPU resources, the service implements a process-level asynchronous lock to serialize inference requests, preventing CUDA “unknown error” wedges caused by Sortformer’s non-thread-safe internal streams.

The diarization service is designed to run on hardware where GPU contention is a risk, such as when a 35B vLLM model is also active. To prevent race conditions and CUDA errors, the service uses an asyncio.Lock (_pipeline_lock) to ensure only one diarization call holds the GPU at any time. This lock serializes HTTP requests at the application layer, ensuring the model never receives overlapping calls even if multiple clients send requests simultaneously.

Inside the lock, the synchronous pipeline execution is offloaded to a thread pool using fastapi.concurrency.run_in_threadpool. This keeps the Uvicorn event loop responsive while the heavy computation occurs. After inference, the service explicitly deletes the input waveform reference and calls torch.cuda.empty_cache() to prevent cached activations from accumulating across calls, which was identified as the primary cause of performance degradation (“wedges”) after many requests.

diagram

The service runs on NVIDIA Blackwell GPUs (SM_121), which are binary-compatible with Hopper (SM_90) but trigger failures in some PyTorch code paths that check the architecture version. The startup routine applies two patches to mitigate this:

  1. Python-level Spoofing: The torch.cuda.get_device_capability function is patched to report SM_90 (Hopper) when the real capability is SM_121 (Blackwell). This bypasses Python-level checks but does not cover C++/CUDA kernels.
  2. CUDA Kernel Fallback: The torch.nn.functional.one_hot kernel is patched to force execution on the CPU for tiny tensors (shape (T, N) where N≤10 speakers). This prevents torch.AcceleratorError: CUDA error: unknown error during powerset.to_multilabel operations under shared-GPU contention. While this adds microsecond overhead, it eliminates the wedge entirely.

If the CUDA state becomes corrupted due to these contention issues, the recommended mitigation is to restart the container using docker compose up -d --force-recreate pyannote-diarize.

The service exposes three endpoints: /health, /v1/diarize, and /v1/embed.

Returns the operational status of the pipeline and GPU.

  • Status 200: {"status": "ok", "diarization_model": true, "embedding_model": true, "gpu_memory_allocated_mb": ..., "gpu_memory_reserved_mb": ...}.
  • Status 503: Returned if the pipeline is not loaded or if a CUDA error is detected during the health check.

Accepts raw s16le PCM or WAV audio in the request body. Configuration is passed via headers:

  • X-Sample-Rate: Default 16000.
  • X-Max-Speakers: Default 4.
  • X-Min-Speakers: Default 2 (auto-disabled for audio < 15s to prevent validation errors).
  • X-Num-Speakers: Exact count override (default none).

Response Shape:

{
"segments": [
{
"speaker_id": int,
"start": float,
"end": float,
"confidence": float,
"embedding": [float]?
}
],
"exclusive_segments": [
{
"speaker_id": int,
"start": float,
"end": float,
"confidence": float
}
],
"num_speakers": int,
"audio_duration_s": float,
"processing_ms": int
}

The segments array allows detection of cross-talk (multiple speakers in the same window), while exclusive_segments provides a single-speaker timeline used for STT-timestamp reconciliation. Speaker embeddings (256-d WeSpeaker) are included if available.

Extracts a speaker embedding from a short audio clip (2-10s recommended).

  • Response: {"embedding": [float...]}.
  • Status 422: Returned if no speaker is detected or if the embedding contains NaN values.