Skip to content

Core Inference Pipeline

The Core Inference Pipeline orchestrates the transformation of raw audio streams into structured, multilingual transcripts and synthesized speech through a series of swappable, asynchronous processing stages. This architecture relies on Abstract Base Classes (ABCs) to decouple the server logic from specific model implementations, allowing for easy swapping of backends such as vLLM-hosted Qwen models or pyannote.audio for diarization. The pipeline manages state through internal buffers and sample offsets, ensuring monotonic tracking of audio chunks as they flow from ingestion through ASR, diarization, translation, and finally TTS, with all interactions mediated by async iterators and queue-based event dispatching.

All inference components are defined as abstract base classes in src/meeting_scribe/backends/base.py, providing a unified interface for the server to manage lifecycle and data flow without knowing the concrete implementation details 1. The hierarchy includes ASRBackend for speech-to-text, DiarizeBackend for speaker identification, TranslateBackend for language conversion, and TTSBackend for speech synthesis. Each backend implements start() and stop() methods to handle model initialization, such as loading weights or warming up the engine, and resource release .

The ASRBackend declares shared state attributes like _buffer, _buffer_samples, and _segment_id on the ABC itself, allowing the server to access these without narrowing the type to a specific subclass . Concrete backends populate these during initialization, while the ABC provides default sentinels to satisfy static type checkers . Similarly, DiarizeBackend defines abstract methods for processing audio and enrolling speakers, while TranslateBackend and TTSBackend define interfaces for translation and synthesis respectively 2.

The ASRBackend receives resampled 16kHz mono float32 audio chunks and yields TranscriptEvent objects containing segment IDs, revision tracking, and finality flags . The primary entry point for audio data is process_audio_bytes(), a byte-oriented convenience wrapper that decodes little-endian int16 PCM bytes into float32, normalizes them, and drains the internal process_audio() method into the event-dispatch pipeline . Concrete backends may override this for zero-copy paths, but the default implementation ensures compatibility .

The process_audio() method is the core abstract interface, accepting a float32 numpy array, a monotonic sample offset, and the sample rate, yielding an AsyncIterator[TranscriptEvent] . A flush() method is also provided to emit final events for any buffered audio . The backend supports runtime language switching via set_languages(), which updates the ASR prompt; concrete implementations like VllmASRBackend override this to invalidate system-prompt caches, while the default is a no-op .

DiarizeBackend assigns cluster_id values to audio segments, distinct from speaker identification which maps clusters to enrolled identities . It processes audio via process_audio(), returning a list of SpeakerAttribution objects, which may include multiple entries for overlapping speech . Speakers are enrolled using enroll_speaker(), which takes a display name and reference audio (3-10 seconds recommended) to return an enrollment ID string .

The TranslateBackend handles text translation between Japanese and English . Its translate() method accepts source text, language codes, and optional prior context . The prior_context argument allows passing a rolling window of earlier (source_text, translation) tuples to the system prompt, helping the model anchor on running topics and avoid hallucinating full sentences from fragmented utterances . The meeting_id argument tags JSONL rows for validation harness attribution . The method raises TimeoutError if translation exceeds the configured timeout .

The TTSBackend synthesizes speech from text, supporting voice cloning via optional reference audio . The synthesize() method takes text, target language, and an optional voice_reference numpy array (3 seconds recommended) to clone a speaker’s voice . It returns float32 audio samples at the specified sample rate .

Live Transcript Event Flow and Queue Management

Section titled “Live Transcript Event Flow and Queue Management”

The pipeline manages data flow through asynchronous iterators and internal buffers, ensuring monotonic sample tracking across stages. Audio enters via process_audio_bytes(), which converts PCM to float32 and passes it to process_audio() . The ASRBackend uses internal buffers (_buffer, _buffer_samples) to accumulate audio chunks before yielding TranscriptEvent objects . These events are dispatched through the backend’s internal pipeline .

Diarization occurs in parallel or subsequent stages, where DiarizeBackend.process_audio() returns SpeakerAttribution lists that are merged with ASR results . Translation and TTS operate on the resulting text, with translation using context windows to maintain coherence . The entire flow is driven by async iterators, allowing non-blocking processing of audio chunks and event yields .