Skip to content

Faster export pipeline with parallel IO, cached offsets, and CPU RGBA→NV12#1786

Open
richiemcilroy wants to merge 17 commits into
mainfrom
rendering-bits
Open

Faster export pipeline with parallel IO, cached offsets, and CPU RGBA→NV12#1786
richiemcilroy wants to merge 17 commits into
mainfrom
rendering-bits

Conversation

@richiemcilroy

@richiemcilroy richiemcilroy commented May 8, 2026

Copy link
Copy Markdown
Member

Improves desktop export and rendering throughput by overlapping independent setup work, reducing redundant CPU work on hot paths, and tightening a few correctness edges around FFmpeg audio/resampling and GPU readback sequencing.

Greptile Summary

This PR improves the desktop export/rendering pipeline through several independent optimisations: parallelising audio/decoder setup I/O, replacing per-frame O(n) clips.iter().find() lookups with cached index maps, swapping the FFmpeg swscale RGBA→NV12 path for a hand-written CPU implementation (with optional rayon parallelism above 1080p), and reordering GPU readback submission to overlap with waiting for the previous frame.

  • Parallel IO: create_segments and RenderVideoConstants::new are now joined with tokio::try_join!, and multiple-segment audio loading is parallelised with try_join_all and rayon::par_iter.
  • Cached clip offsets: Both render paths (render_video_to_channel / _nv12) now build a Vec<ClipOffsets> once before the frame loop, eliminating repeated clips.iter().find() calls per frame.
  • CPU RGBA→NV12: A new cpu_yuv::rgba_to_nv12_fast replaces the swscale context creation per frame, using rayon row-level parallelism above 1080p; the conversion function is also centralised to remove two separate inline copies.
  • Audio renderer: gain is now pre-converted to linear scale, the inner per-frame mixing loop is restructured for better cache locality, and timeline_cursor caches its walk state to avoid re-scanning segments each call.

Confidence Score: 4/5

Safe to merge for most platforms; the removal of the Windows-only sequential decoder init guard deserves a macOS smoke-test before shipping.

The core optimisations are well-tested with baseline-matching unit tests. Two areas warrant attention: concurrent decoder init is now unconditional on all platforms (original rationale for the Windows-only guard is undocumented), and the duplicated _into/_to_slice render methods require parallel maintenance.

crates/rendering/src/lib.rs (removed platform guard for decoder init), crates/editor/src/audio.rs (duplicated _into/_to_slice render methods)

Important Files Changed

Filename Overview
crates/audio/src/audio_data.rs Simplified resampler flush loop: all intermediate per-packet flushes removed, replaced by a single final drain. Functionally equivalent for from_file but relies on the final flush contract.
crates/audio/src/renderer.rs Refactored mixing loop for cache-locality: per-track bulk iteration replaces per-sample cross-track loop; gain field renamed to linear_gain (pre-converted); valid_output_range helper correctly handles negative track offsets; well-tested by baseline-matching tests.
crates/editor/src/audio.rs Added _to_slice variants of render methods to avoid allocations; introduced cached clip-offsets with ClipOffsets::default() fallback; timeline_cursor now caches segment walk state. Duplicated logic between _into and _to_slice paths creates maintenance risk.
crates/enc-ffmpeg/src/audio/buffered_resampler.rs Resampler is now skipped when input/output formats already match; passthrough path correctly updates min_next_pts and the top-level PTS adjustment guard still applies.
crates/export/src/mp4.rs RGBA→NV12 via swscale replaced by cpu_yuv::rgba_to_nv12_fast; audio render moved after video encode-queue call; channel capacity bumped from 4 to 8; fill_nv12_frame_direct tightened to avoid double borrow of frame data.
crates/rendering/src/cpu_yuv.rs New rgba_to_nv12_fast function with scalar and rayon-parallel paths; BT.601 coefficients correct; validation guard returns false without writing on invalid input; baseline-matching tests added.
crates/rendering/src/frame_pipeline.rs GPU readback submission now precedes the wait on the previous frame, enabling better pipelining; error paths correctly cancel the newly-submitted readback before returning.
crates/rendering/src/lib.rs Clip-offsets now cached as a Vec before the render loop; upfront full zoom precompute removed in favour of per-frame lazy calls (correctly placed at lines 973/615). Platform guard for sequential decoder init removed; original rationale undocumented.
crates/rendering/src/project_recordings.rs Segment metadata loading parallelised with rayon::par_iter; RefCell is created inside each closure so Send is satisfied; Rayon preserves Vec order on collect.
crates/rendering/src/zoom_focus_interpolation.rs Tests only added; lazy_precompute_matches_full_precompute validates that per-frame ensure_precomputed_until produces identical results to upfront full precompute.

Comments Outside Diff (2)

  1. crates/rendering/src/lib.rs, line 312-320 (link)

    P2 Platform-specific sequential init removed without explanation

    The original code used #[cfg(target_os = "windows")] to run screen_future and camera_future concurrently only on Windows, keeping them sequential on macOS/Linux. This PR removes that guard and makes concurrent init unconditional. If the sequential path existed to work around a macOS-specific constraint (e.g. an AVFoundation or Core Video limitation that requires serial decoder construction on the main thread), running both futures concurrently on non-Windows builds could cause intermittent init failures or deadlocks that are hard to reproduce.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/rendering/src/lib.rs
    Line: 312-320
    
    Comment:
    **Platform-specific sequential init removed without explanation**
    
    The original code used `#[cfg(target_os = "windows")]` to run `screen_future` and `camera_future` concurrently only on Windows, keeping them sequential on macOS/Linux. This PR removes that guard and makes concurrent init unconditional. If the sequential path existed to work around a macOS-specific constraint (e.g. an AVFoundation or Core Video limitation that requires serial decoder construction on the main thread), running both futures concurrently on non-Windows builds could cause intermittent init failures or deadlocks that are hard to reproduce.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. crates/audio/src/audio_data.rs, line 71-101 (link)

    P2 Per-packet resampler flush removed; final flush must drain everything

    The old code unconditionally flushed the resampler after each packet's receive_frame loop (and again after the conditional inner flush when run() signalled a delay). The new code removes all intermediate flushes and relies entirely on the final loop { resampler.flush() } after send_eof. For the from_file use case this is fine because the final flush loop runs until resample_delay is None, which by libswresample contract means the delay buffer is fully drained. However, the correctness of the simplified path depends entirely on that contract; if the upstream FFmpeg Rust bindings ever change the semantics of flush returning None, audio data could silently be truncated.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/audio/src/audio_data.rs
    Line: 71-101
    
    Comment:
    **Per-packet resampler flush removed; final flush must drain everything**
    
    The old code unconditionally flushed the resampler after each packet's `receive_frame` loop (and again after the conditional inner flush when `run()` signalled a delay). The new code removes all intermediate flushes and relies entirely on the final `loop { resampler.flush() }` after `send_eof`. For the `from_file` use case this is fine because the final flush loop runs until `resample_delay` is `None`, which by libswresample contract means the delay buffer is fully drained. However, the correctness of the simplified path depends entirely on that contract; if the upstream FFmpeg Rust bindings ever change the semantics of `flush` returning `None`, audio data could silently be truncated.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
crates/rendering/src/lib.rs:312-320
**Platform-specific sequential init removed without explanation**

The original code used `#[cfg(target_os = "windows")]` to run `screen_future` and `camera_future` concurrently only on Windows, keeping them sequential on macOS/Linux. This PR removes that guard and makes concurrent init unconditional. If the sequential path existed to work around a macOS-specific constraint (e.g. an AVFoundation or Core Video limitation that requires serial decoder construction on the main thread), running both futures concurrently on non-Windows builds could cause intermittent init failures or deadlocks that are hard to reproduce.

### Issue 2 of 3
crates/editor/src/audio.rs:277-340
**Duplicated timeline/linear render logic across `_into` and `_to_slice` variants**

`render_timeline_frame_into` and `render_timeline_frame_to_slice` are nearly identical: both walk the same timeline-cursor state machine and call `render_current_chunk`. The only structural difference is that the `_into` variant resizes a `Vec<f32>` while the `_to_slice` variant takes a pre-sized `&mut [f32]`. The same duplication exists between `render_linear_frame_into` and `render_linear_frame_to_slice`. Any future behavioural fix (e.g. to the `chunk_samples` calculation or the cursor advance) must be applied in both variants independently, increasing the chance of divergence.

### Issue 3 of 3
crates/audio/src/audio_data.rs:71-101
**Per-packet resampler flush removed; final flush must drain everything**

The old code unconditionally flushed the resampler after each packet's `receive_frame` loop (and again after the conditional inner flush when `run()` signalled a delay). The new code removes all intermediate flushes and relies entirely on the final `loop { resampler.flush() }` after `send_eof`. For the `from_file` use case this is fine because the final flush loop runs until `resample_delay` is `None`, which by libswresample contract means the delay buffer is fully drained. However, the correctness of the simplified path depends entirely on that contract; if the upstream FFmpeg Rust bindings ever change the semantics of `flush` returning `None`, audio data could silently be truncated.

Reviews (1): Last reviewed commit: "chore(export): add encoder benchmark exa..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

@superagent-security superagent-security Bot added contributor:verified Contributor passed trust analysis. pr:verified PR passed security analysis. labels May 8, 2026

fn f32_slice_from_audio_frame(frame: &mut FFAudio, len: usize) -> &mut [f32] {
let bytes = &mut frame.data_mut(0)[..len * f32::BYTE_SIZE];
unsafe { std::slice::from_raw_parts_mut(bytes.as_mut_ptr() as *mut f32, len) }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from_raw_parts_mut(bytes.as_mut_ptr() as *mut f32, len) assumes the FFmpeg buffer is always aligned for f32. If that ever regresses, this becomes UB.

Suggested change
unsafe { std::slice::from_raw_parts_mut(bytes.as_mut_ptr() as *mut f32, len) }
fn f32_slice_from_audio_frame(frame: &mut FFAudio, len: usize) -> &mut [f32] {
let bytes = &mut frame.data_mut(0)[..len * f32::BYTE_SIZE];
let ptr = bytes.as_mut_ptr();
assert_eq!(ptr.align_offset(std::mem::align_of::<f32>()), 0);
unsafe { std::slice::from_raw_parts_mut(ptr.cast::<f32>(), len) }
}

let Some(max_index) = project.clips.iter().map(|clip| clip.index as usize).max() else {
self.clip_offsets_by_index.clear();
return;
};

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resize(max_index + 1, …) can become an accidental huge allocation if clip.index is sparse / untrusted (e.g. a project file with a very large index). Since clip_offsets_for_project already has a linear fallback, it seems worth guarding the cache.

Suggested change
};
let Some(max_index) = project.clips.iter().map(|clip| clip.index as usize).max() else {
self.clip_offsets_by_index.clear();
return;
};
if max_index > 100_000 {
self.clip_offsets_by_index.clear();
return;
}

fn is_valid_for(self, rgba: &[u8], output: &[u8]) -> bool {
if self.width == 0 || self.height == 0 {
return false;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uv_height() / width / 2 assumes even dimensions. If a caller ever passes odd width/height, this will silently drop chroma samples (and the buffer size math gets ambiguous). It might be safer to reject odd sizes up front.

Suggested change
}
if self.width == 0 || self.height == 0 || self.width % 2 != 0 || self.height % 2 != 0 {
return false;
}

Comment on lines 277 to 340
@@ -219,8 +340,8 @@ impl AudioRenderer {
return None;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Duplicated timeline/linear render logic across _into and _to_slice variants

render_timeline_frame_into and render_timeline_frame_to_slice are nearly identical: both walk the same timeline-cursor state machine and call render_current_chunk. The only structural difference is that the _into variant resizes a Vec<f32> while the _to_slice variant takes a pre-sized &mut [f32]. The same duplication exists between render_linear_frame_into and render_linear_frame_to_slice. Any future behavioural fix (e.g. to the chunk_samples calculation or the cursor advance) must be applied in both variants independently, increasing the chance of divergence.

Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/editor/src/audio.rs
Line: 277-340

Comment:
**Duplicated timeline/linear render logic across `_into` and `_to_slice` variants**

`render_timeline_frame_into` and `render_timeline_frame_to_slice` are nearly identical: both walk the same timeline-cursor state machine and call `render_current_chunk`. The only structural difference is that the `_into` variant resizes a `Vec<f32>` while the `_to_slice` variant takes a pre-sized `&mut [f32]`. The same duplication exists between `render_linear_frame_into` and `render_linear_frame_to_slice`. Any future behavioural fix (e.g. to the `chunk_samples` calculation or the cursor advance) must be applied in both variants independently, increasing the chance of divergence.

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor:verified Contributor passed trust analysis. pr:verified PR passed security analysis.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant