Skip to content

OOXML Manipulation

The OOXML editing primitives in ppt_craft provide a low-level, direct manipulation layer for PowerPoint files, bypassing high-level abstractions to ensure byte-equal round-trips and full feature parity with desktop PowerPoint 1. This package operates by loading the .pptx as a mutable in-memory ZIP archive, enumerating parts via [Content_Types].xml to avoid hardcoded path assumptions, and applying surgical XML mutations using lxml 2. It handles slide geometry, text content, native chart insertion, and theme properties while enforcing strict validation against geometry bounds and placeholder residue 1.

The core entry point is load(), which reads a .pptx file into a PptxPackage dataclass containing two dictionaries: parts (mapping zip paths to raw bytes) and zinfos (preserving original ZipInfo metadata like compression level and modification time) 2. This structure allows for replace() operations and to_bytes() repackaging that maintains byte-equal fidelity.

Crucially, the system never assumes standard part names like theme1.xml or slideMaster1.xml. Instead, enumerate_parts() parses [Content_Types].xml to dynamically discover all themes, masters, layouts, slides, and charts. This is vital for multi-master decks where multiple theme files exist.

Slide geometry is not assumed; slide_size_emu() explicitly reads <p:sldSz> from ppt/presentation.xml to determine canvas dimensions in EMUs, falling back to widescreen defaults only if the element is missing.

Slide editing is handled in slide_edit.py, which operates directly on lxml roots in-place 3. Shapes are located by their id attribute via find_shape_by_id(), searching <p:sp>, <p:graphicFrame>, and <p:pic> elements.

Text manipulation is granular. apply_paragraphs() replaces the entire text body of a shape, preserving <a:bodyPr> and <a:lstStyle> while nuking existing <a:p> elements. It constructs runs with specific formatting (bold, italic, size, color) using _build_run_xml(). For targeted updates, set_run_text() allows in-place modification of specific <a:t> elements by paragraph and run index.

Geometry manipulation is achieved via move_shape(), which sets <a:off> (position) and <a:ext> (size) attributes within the shape’s <a:xfrm> element. shape_bbox_emu() retrieves the current bounding box (x, y, cx, cy) for layout calculations.

Chart handling in charts.py delegates to python-pptx rather than hand-rolling XML 4. This is because bespoke lxml generation fails to produce the embedded Excel workbook and c:externalData link required for the “Edit Data” feature in desktop PowerPoint.

add_chart_to_slide() maps compact chart types (e.g., “bar”, “scatter”) to python-pptx enums and inserts the chart at specified EMU coordinates. It supports scatter plots using XyChartData and other types using CategoryChartData.

After insertion, _recolour_series() performs a post-pass on the chart’s XML (<c:ser> elements) to apply theme palette colors directly to <a:solidFill> elements, bypassing python-pptx’s high-level formatting which may lose style for certain chart types.

Theme editing in theme_edit.py ensures that palette and font changes are applied to all theme parts in the deck, not just theme1.xml 5. This addresses a previous failure mode where multi-master decks were partially updated.

set_palette() iterates through all theme parts found by enumerate_parts(), locating <a:clrScheme> and rewriting slot values (e.g., “dk1”, “accent1”) to the provided hex palette. It normalizes existing color definitions to <a:srgbClr>.

set_fonts() similarly iterates through theme parts to update <a:fontScheme>, modifying <a:latin> typeface attributes for both major (heading) and minor (body) fonts.

The validate.py module provides geometry and placeholder validation without requiring a LibreOffice round-trip 6. It uses the deck-specific slide size from slide_size_emu() as the boundary for all checks.

validate_pptx() iterates through all slides and shapes, checking for:

  1. Placeholder Residue: Text matching patterns like “click to add” or “lorem ipsum”.
  2. Out-of-Bounds Shapes: Shapes whose bounding box extends beyond the slide canvas dimensions.
  3. Shape Overlaps: Shapes intersecting by more than 5% of the smaller shape’s area.

It returns a list of ValidationIssue objects detailing the code, slide ID, shape ID, message, and fix hints.