Workflow & State Management
The durable workflow registry in autosre persists active workflow definitions and swap coordination markers to workflows.json on disk, ensuring state survives process crashes or power loss. This registry is protected by a file lock (.workflows.lock) to prevent concurrent corruption, while writes utilize an atomic, durable sequence involving temporary files and parent directory fsyncs. A critical component is the swap_in_progress marker, which coordinates model swaps by tracking the initiating process’s identity; this marker is subject to automatic reconciliation to clear orphaned entries if the holder process dies without releasing the associated .swap.lock.
Durable State Storage and Atomic Writes
Section titled “Durable State Storage and Atomic Writes”The core state file is workflows.json, located in the data directory. It contains two primary structures: a list of active Workflow entries and an optional SwapInProgress marker. All mutations to this file are guarded by the .workflows.lock flock.
Writes to the registry are performed via atomic_durable_write, which ensures durability across power loss. This function writes data to a sibling temporary file, performs an fsync on that file, renames it to the target path, and finally performs an fsync on the parent directory. The parent directory fsync is critical because it ensures the directory entry update is durable, preventing the file from being lost even if the contents survive. This helper is lock-free; callers are responsible for acquiring the appropriate flock before calling it.
Workflow Registry and Lifecycle
Section titled “Workflow Registry and Lifecycle”The Workflow dataclass represents an active workflow, tracking its name, required recipe, owner process identity (pid and start_ns), and timing information (started_at, last_renewed_at, ttl_seconds). The registry supports registering, renewing, and releasing workflows.
When a workflow is registered, the system checks for any live swap_in_progress marker targeting a different recipe. If such a marker exists, registration is refused with an EAgain error to prevent conflicting model swaps. If the marker targets the same recipe, registration is allowed. During registration, the system also sweeps for dead workflows (where the owner process is no longer alive or TTL has expired) and removes them, unlinking any associated preempt signal files.
Liveness of a workflow owner is determined by checking if the process exists and if its start time (starttime from /proc/<pid>/stat) matches the recorded owner_start_ns. This check defeats PID recycling, where a new process might reuse the same PID as a dead one.
Swap Coordination and Orphan Reconciliation
Section titled “Swap Coordination and Orphan Reconciliation”The swap_in_progress marker ties a pending model swap to the process that initiated it. This marker includes the holder_pid and holder_start_ns of the swap coordinator. The marker is stored within workflows.json but is coordinated via the .swap.lock flock.
Because a swap process might crash or be killed without clearing the marker, the system employs automatic reconciliation. Before any gate check or read operation that encounters the marker, reconcile_swap_marker is called. This function:
- Acquires
.workflows.lock. - Checks if the
holder_pidis still alive AND still holds.swap.lock. - If the holder is dead or does not hold the lock, the marker is considered orphaned.
- The marker is cleared from
workflows.json, and anorphan_marker_clearedevent is appended toswap_log.jsonl.
This ensures that a crashed swap process cannot wedge the registry indefinitely. The swap coordinator must hold .swap.lock for the duration of the swap; if it releases the lock (or dies), the marker is automatically invalidated by subsequent reconciliations.
Preemption and Signal Files
Section titled “Preemption and Signal Files”For workflows marked as preemptible, the system supports graceful yielding during a swap. When a swap is initiated, the system identifies conflicting workflows (those requiring a different recipe). For preemptible conflicts, a signal file is written to the path specified by preempt_signal_path in the workflow registry.
Preemptible workflows poll for the existence of this signal file at safe checkpoints. If the file exists, the workflow yields gracefully. The signal file is created by request_preempt and removed by clear_preempt_signal when the workflow yields. If a workflow is removed from the registry (e.g., due to expiration), any associated preempt signal file is also cleaned up to prevent stale signals from affecting future registrations.