Core Inference Pipeline
The Core Inference Pipeline orchestrates the transformation of raw audio streams into structured, multilingual transcripts and synthesized speech through a series of swappable, asynchronous processing stages. This architecture relies on Abstract Base Classes (ABCs) to decouple the server logic from specific model implementations, allowing for easy swapping of backends such as vLLM-hosted Qwen models or pyannote.audio for diarization. The pipeline manages state through internal buffers and sample offsets, ensuring monotonic tracking of audio chunks as they flow from ingestion through ASR, diarization, translation, and finally TTS, with all interactions mediated by async iterators and queue-based event dispatching.
Backend Abstractions and Initialization
Section titled “Backend Abstractions and Initialization”All inference components are defined as abstract base classes in src/meeting_scribe/backends/base.py, providing a unified interface for the server to manage lifecycle and data flow without knowing the concrete implementation details 1. The hierarchy includes ASRBackend for speech-to-text, DiarizeBackend for speaker identification, TranslateBackend for language conversion, and TTSBackend for speech synthesis. Each backend implements start() and stop() methods to handle model initialization, such as loading weights or warming up the engine, and resource release .
The ASRBackend declares shared state attributes like _buffer, _buffer_samples, and _segment_id on the ABC itself, allowing the server to access these without narrowing the type to a specific subclass . Concrete backends populate these during initialization, while the ABC provides default sentinels to satisfy static type checkers . Similarly, DiarizeBackend defines abstract methods for processing audio and enrolling speakers, while TranslateBackend and TTSBackend define interfaces for translation and synthesis respectively 2.
ASR and Diarization Processing
Section titled “ASR and Diarization Processing”The ASRBackend receives resampled 16kHz mono float32 audio chunks and yields TranscriptEvent objects containing segment IDs, revision tracking, and finality flags . The primary entry point for audio data is process_audio_bytes(), a byte-oriented convenience wrapper that decodes little-endian int16 PCM bytes into float32, normalizes them, and drains the internal process_audio() method into the event-dispatch pipeline . Concrete backends may override this for zero-copy paths, but the default implementation ensures compatibility .
The process_audio() method is the core abstract interface, accepting a float32 numpy array, a monotonic sample offset, and the sample rate, yielding an AsyncIterator[TranscriptEvent] . A flush() method is also provided to emit final events for any buffered audio . The backend supports runtime language switching via set_languages(), which updates the ASR prompt; concrete implementations like VllmASRBackend override this to invalidate system-prompt caches, while the default is a no-op .
DiarizeBackend assigns cluster_id values to audio segments, distinct from speaker identification which maps clusters to enrolled identities . It processes audio via process_audio(), returning a list of SpeakerAttribution objects, which may include multiple entries for overlapping speech . Speakers are enrolled using enroll_speaker(), which takes a display name and reference audio (3-10 seconds recommended) to return an enrollment ID string .
Translation and TTS Stages
Section titled “Translation and TTS Stages”The TranslateBackend handles text translation between Japanese and English . Its translate() method accepts source text, language codes, and optional prior context . The prior_context argument allows passing a rolling window of earlier (source_text, translation) tuples to the system prompt, helping the model anchor on running topics and avoid hallucinating full sentences from fragmented utterances . The meeting_id argument tags JSONL rows for validation harness attribution . The method raises TimeoutError if translation exceeds the configured timeout .
The TTSBackend synthesizes speech from text, supporting voice cloning via optional reference audio . The synthesize() method takes text, target language, and an optional voice_reference numpy array (3 seconds recommended) to clone a speaker’s voice . It returns float32 audio samples at the specified sample rate .
Live Transcript Event Flow and Queue Management
Section titled “Live Transcript Event Flow and Queue Management”The pipeline manages data flow through asynchronous iterators and internal buffers, ensuring monotonic sample tracking across stages. Audio enters via process_audio_bytes(), which converts PCM to float32 and passes it to process_audio() . The ASRBackend uses internal buffers (_buffer, _buffer_samples) to accumulate audio chunks before yielding TranscriptEvent objects . These events are dispatched through the backend’s internal pipeline .
Diarization occurs in parallel or subsequent stages, where DiarizeBackend.process_audio() returns SpeakerAttribution lists that are merged with ASR results . Translation and TTS operate on the resulting text, with translation using context windows to maintain coherence . The entire flow is driven by async iterators, allowing non-blocking processing of audio chunks and event yields .
"""Abstract base classes for inference backends.
All inference is behind swappable ABCs for GB10 production.
Backend hierarchy:
ASRBackend - Qwen3-ASR-1.7B via vLLM
DiarizeBackend - pyannote.audio speaker diarization
TranslateBackend - vLLM+Qwen3.6
TTSBackend - Qwen3-TTS via vLLM
"""
from __future__ import annotations
from abc import ABC, abstractmethod
from collections.abc import AsyncIterator
from typing import TYPE_CHECKING
import numpy as np
if TYPE_CHECKING:
from meeting_scribe.models import SpeakerAttribution, TranscriptEvent
class ASRBackend(ABC):
"""Abstract speech-to-text backend.
Receives resampled 16kHz mono float32 audio chunks.
Yields TranscriptEvents with segment_id, revision tracking,
and is_final flag.
"""
# ── Shared state attributes (declared on ABC so server.py can
# access them without narrowing to a concrete subclass). Concrete
# backends are expected to populate these in __init__; the ABC
# provides default sentinels so mypy sees the attributes exist on
# the union of all backend types.
_buffer: list[np.ndarray]
_buffer_samples: int
_segment_id: str | None
_base_offset: int
@abstractmethod
async def process_audio(
self,
audio: np.ndarray,
sample_offset: int,
sample_rate: int = 16000,
) -> list[SpeakerAttribution]:
"""Assign speakers to an audio chunk.
Returns a list of SpeakerAttributions (may be >1 for overlapping speech).
"""
@abstractmethod
async def enroll_speaker(
self,
name: str,
audio: np.ndarray,
sample_rate: int = 16000,
) -> str:
"""Enroll a speaker from a reference audio clip.
Args:
name: Display name (e.g., "Tanaka").
audio: Reference audio (3-10 seconds recommended).
sample_rate: Audio sample rate.
Returns:
Enrollment ID string.
"""
class TranslateBackend(ABC):
"""Abstract translation backend.
Translates text between Japanese and English.
"""
@abstractmethod
async def start(self) -> None:
"""Initialize translation model."""