meetingscribe

meetingscribe is a real-time multilingual meeting transcription system designed for the Dell Pro Max with GB10 ¹. It orchestrates local hardware, external AI backends, and real-time WebSocket streams to transcribe, translate, and synthesize interpretation audio between any pair of 20 supported languages. The system identifies speakers via diarization, streams interpretation audio to guests over a local WiFi hotspot, and records the full meeting with time-aligned audio and bilingual transcript views.

Subsystem	Description
FastAPI Server	Central orchestrator managing real-time transcription, translation, and TTS pipelines .
ASR Backend	Qwen3-ASR-1.7B model running on vLLM for real-time multilingual transcription .
Translation Backend	Qwen3.6-35B-A3B-FP8 model on vLLM for multilingual translation and name extraction .
TTS Backend	Qwen3-TTS-12Hz-0.6B-Base model using faster-qwen3-tts for synthesized interpretation audio .
Diarization Backend	pyannote.audio 4.0.4 for speaker identification with optional voice enrollment .
PipeWire Audio	Server-side mic capture and local playback routing for Poly room devices and headsets .
WiFi Hotspot	Local WiFi AP with captive portal for guest device access and interpretation audio streaming .

README.md L1-120 (showing 40 of 120)

# Meeting Scribe

> **Disclaimer: unofficial and unsupported.** Provided for testing and
> evaluation only, on an "AS IS" basis, with no warranty and no support. Not
> affiliated with or endorsed by Dell. See [DISCLAIMER.md](DISCLAIMER.md).

Real-time multilingual meeting transcription running on the Dell Pro Max with GB10. Transcribes and translates live speech between any pair of 20 supported languages (10 of them TTS-capable for synthesized interpretation audio), identifies speakers via diarization, streams the interpretation audio track to guests over a local WiFi hotspot, and records the full meeting.

Live demo + technical details: <https://sddcinfo.github.io/meetingscribe/>

## Model Stack

All models run locally on a single GB10 node (aarch64 Linux, 128 GB unified memory, CUDA 13.0) - no cloud dependency:

| Component | Model | Backend | Default port |
|-----------|-------|---------|------|
| ASR | Qwen3-ASR-1.7B | vLLM | 8003 |
| Translation | Qwen3.6-35B-A3B-FP8 | vLLM | 8010 |
| TTS (interpretation audio) | Qwen3-TTS-12Hz-0.6B-Base | faster-qwen3-tts | 8002 |
| Diarization | pyannote.audio 4.0.4 (`speaker-diarization-community-1`) | Custom container | 8001 |

The 35B FP8 translation model is the heaviest component (≈35 GB VRAM when loaded + KV cache). Combined footprint runs ≈43 GB, well inside the GB10's 128 GB unified pool. The translate vLLM endpoint is also the primary sharing point if you run [autosre](https://github.com/sddcinfo/autosre) on the same box - both point at `:8010`.

Operational details for the live ASR -> translate -> TTS path, including the
GB10 TTS runtime choice and saved-meeting regression gates, are documented in
[`docs/gb10-live-stack.md`](docs/gb10-live-stack.md).

## Features

- **Multi-language ASR** - Qwen3-ASR with 52-language support + auto-detection. Per-language quality verified on 19 of the 20 (most languages clear ≤5% p50 normalized error on Fleurs). Malay (`ms`) is best-effort: Fleurs has no Malay split, so we score Indonesian (`id`) as a proxy.
- **Configurable language pairs** - Any pair from **20 supported languages**. 10 of those are TTS-capable (English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, Russian) and enable the interpretation-audio feature end-to-end. The other 10 (Dutch, Arabic, Thai, Vietnamese, Indonesian, Malay, Hindi, Turkish, Polish, Ukrainian) work for ASR + translate only - no synthesized interpretation audio. Default pair is `en,ja`; set `SCRIBE_LANGUAGE_PAIR` to change.
- **Interpretation audio** - For TTS-capable language pairs only. Near-real-time translated audio track, per-client language preference, streamed to hotspot guests over a dedicated WebSocket.
- **GB10 audio routing** - Server-side mic capture and local playback routing for the Poly room device plus private admin/headset TTS. The same controls are available before a meeting on the setup page, during a meeting in the admin controls, and from `meeting-scribe audio`.
- **Speaker diarization** - pyannote-based speaker identification with optional voice enrollment
- **1:1 conversation mode** - Full-screen split for 2-person bilingual conversations
- **Metrics dashboard** - Split-view real-time performance stats (memory, ASR, translation latency)
- **Bilingual transcript view** - Every utterance shown in its original language side-by-side with the translation
- **Slide translation** - Upload a PPTX; the deck is translated slide-by-slide and rendered progressively during the meeting (adaptive batching + concurrent LibreOffice renders for fast first-paint)
- **Audio recording** - Time-aligned PCM with segment-level playback and a podcast-style player
- **Room setup** - Drag-and-drop table/seat editor with 8 presets