Slide Processing

The slide processing pipeline in meetingscribe is designed to handle the ingestion, conversion, translation, and rendering of PowerPoint presentations through a multi-phase background worker system. The architecture prioritizes perceived performance by splitting processing into a fast validation phase and slower rendering/extraction phases, allowing the deck to be activated immediately after validation. The system leverages asyncio.to_thread() to run CPU-bound tasks like LibreOffice conversion in background threads without blocking the main event loop, ensuring that text extraction and translation can proceed in parallel with the slower image rendering steps.

Pipeline Overview and Orchestration

The pipeline is defined in the src/meeting_scribe/slides module, which serves as the entry point for slide upload, translation, and rendering logic ¹. The core orchestration happens in worker.py, which manages the lifecycle of a slide deck through distinct stages tracked in a meta.json file. The worker checks for the availability of LibreOffice via check_worker_available() before proceeding ². The processing is split into phases to optimize latency: Phase 1 validates the file, Phase 2 extracts text and renders originals, and Phase 3 handles translation reinsertion and final rendering.

Phase 1: Fast Validation

The first phase is designed to complete in under one second, allowing the system to provide immediate feedback to the user. The run_validate function executes _validate_sync in a background thread using asyncio.to_thread(). This synchronous function writes the uploaded bytes to _upload.pptx and calls validate_pptx_contents to verify the file’s integrity. If validation fails, the meta.json file is updated with a FAILED status and the error message, raising a RuntimeError to halt the pipeline. Upon success, the slide count is recorded in the metadata, and the stage is marked as DONE.

Phase 2: Text Extraction and Original Rendering

Phase 2 is split into two concurrent tasks to minimize total processing time. The system prefers running run_extract_text and run_render_originals concurrently rather than sequentially.

Text Extraction

The run_extract_text function calls _extract_text_sync, which uses python-pptx to extract text runs and speaker notes ³. This step is fast (~1-2 seconds) and produces text_extract.json and slide_notes.json. The extraction of speaker notes is non-fatal; if it fails, a sentinel JSON file is written to distinguish between “no notes” and “extraction failure”. This parallelism allows translation services to start processing text while the slower image rendering occurs in the background.

Original Rendering

The run_render_originals function handles the conversion of the original PPTX to images, a process that typically takes 25-30 seconds for a 50-slide deck ². It calls _render_originals_sync, which uses convert_pptx_to_images (LibreOffice + pdftoppm) to generate PNGs in the original/ directory. Progress is broadcast to the event loop via _make_thread_safe_progress, which wraps the async ProgressBroadcast callable to ensure thread safety. A legacy function run_render_and_extract exists for backward compatibility but performs these steps sequentially and is discouraged for new code.

Phase 3: Translation Reinsertion and Final Rendering

Once translations are available, the pipeline proceeds to reinsert the translated text into the PPTX and render the final slides. The run_reinsert function executes _run_reinsert_sync in a background thread. This function calls reinsert_translated_text to modify the PPTX, saving it as translated.pptx. Subsequently, render_translated_to_images generates the final PNGs in the translated/ directory. The pipeline marks the stage as complete and records the completed_at timestamp upon finishing.

Express-Lane Partial Rendering

To improve user experience, the system supports an “express lane” that renders only the first 1-2 translated slides quickly. The run_partial_translated_render function handles this by creating a unique scratch directory for each invocation to avoid race conditions when multiple batches are processed in parallel. It calls render_partial_translated to generate PNGs for specific slide indices, then cleans up the temporary work directory. Additionally, run_translated_pdf_only provides a post-express finalizer that generates translated.pptx and original.pdf without re-rendering PNGs, leveraging the fact that the express lane already produced the necessary images.

src/meeting_scribe/slides/__init__.py L1-2

"""Slide upload, translation, and rendering pipeline for meeting-scribe."""

src/meeting_scribe/slides/worker.py L1-120 (showing 40 of 120)

"""PPTX processing worker - runs conversion in background threads.

Trusted environment - no Docker sandboxing. Processing runs in
asyncio.to_thread() to avoid blocking the event loop.

Split into fast (validate) and slow (render + extract) phases so the
deck can be activated immediately after validation, before rendering.
"""

from __future__ import annotations

import asyncio
import json
import logging
import uuid
from collections.abc import Awaitable, Callable
from pathlib import Path

from meeting_scribe.slides.convert import (
    convert_pptx_to_images,
    extract_notes_from_pptx,
    extract_text_from_pptx,
    reinsert_translated_text,
    render_pptx_to_pdf,
    render_translated_to_images,
    validate_pptx_contents,
    write_text_extract,
)
from meeting_scribe.slides.models import SlideMeta, StageProgress, StageStatus
from meeting_scribe.util.atomic_io import atomic_write_json

logger = logging.getLogger(__name__)

ProgressBroadcast = Callable[[int, int], Awaitable[None]]
"""Async callable invoked from the event loop with (slide_idx_0based, total)."""


def _make_thread_safe_progress(
    loop: asyncio.AbstractEventLoop,
    broadcast: ProgressBroadcast | None,

src/meeting_scribe/slides/worker.py L121-240 (showing 40 of 120)

    input_path = output_dir / "_upload.pptx"
    meta_path = output_dir / "meta.json"
    meta_dict = json.loads(meta_path.read_text()) if meta_path.exists() else {}

    meta_dict["stage"] = "extracting_text"
    meta_dict.setdefault("stages", {})["extracting_text"] = {"status": "in_progress"}
    atomic_write_json(meta_path, meta_dict)

    slides = extract_text_from_pptx(input_path)
    write_text_extract(slides, output_dir / "text_extract.json")

    # Speaker notes for the admin "translation + presentation" view.
    # Failures don't fail the whole extraction stage - notes are a
    # nice-to-have; logging tells us if it broke. Lives in
    # slide_notes.json beside text_extract.json so the API route at
    # GET /api/meetings/<id>/slides/notes can read it directly.
    # Always write the file (even on failure) so the route can
    # distinguish "extraction failed" from "deck has no notes" from
    # "this slide is blank in a notes-bearing deck".
    try:
        notes_list = extract_notes_from_pptx(input_path)
        any_notes = any(n.strip() for n in notes_list)
        (output_dir / "slide_notes.json").write_text(
            json.dumps(
                {"notes": notes_list, "deck_has_any_notes": any_notes},
                ensure_ascii=False,
                indent=2,
            )
        )
        non_empty = sum(1 for n in notes_list if n.strip())
        logger.info("Extracted speaker notes: %d / %d slides", non_empty, len(notes_list))
    except Exception as notes_exc:
        logger.warning("speaker notes extraction skipped: %s", notes_exc)
        # Persist a sentinel so the route surfaces "couldn't load notes"
        # instead of silently degrading to "no notes" - which the
        # operator otherwise can't distinguish from a genuinely blank
        # deck.
        try:
            (output_dir / "slide_notes.json").write_text(
                json.dumps(