Faster export pipeline with parallel IO, cached offsets, and CPU RGBA→NV12 by richiemcilroy · Pull Request #1786 · CapSoftware/Cap

richiemcilroy · 2026-05-08T09:35:32Z

Improves desktop export and rendering throughput by overlapping independent setup work, reducing redundant CPU work on hot paths, and tightening a few correctness edges around FFmpeg audio/resampling and GPU readback sequencing.

Greptile Summary

This PR improves the desktop export/rendering pipeline through several independent optimisations: parallelising audio/decoder setup I/O, replacing per-frame O(n) clips.iter().find() lookups with cached index maps, swapping the FFmpeg swscale RGBA→NV12 path for a hand-written CPU implementation (with optional rayon parallelism above 1080p), and reordering GPU readback submission to overlap with waiting for the previous frame.

Parallel IO: create_segments and RenderVideoConstants::new are now joined with tokio::try_join!, and multiple-segment audio loading is parallelised with try_join_all and rayon::par_iter.
Cached clip offsets: Both render paths (render_video_to_channel / _nv12) now build a Vec<ClipOffsets> once before the frame loop, eliminating repeated clips.iter().find() calls per frame.
CPU RGBA→NV12: A new cpu_yuv::rgba_to_nv12_fast replaces the swscale context creation per frame, using rayon row-level parallelism above 1080p; the conversion function is also centralised to remove two separate inline copies.
Audio renderer: gain is now pre-converted to linear scale, the inner per-frame mixing loop is restructured for better cache locality, and timeline_cursor caches its walk state to avoid re-scanning segments each call.

Confidence Score: 4/5

Safe to merge for most platforms; the removal of the Windows-only sequential decoder init guard deserves a macOS smoke-test before shipping.

The core optimisations are well-tested with baseline-matching unit tests. Two areas warrant attention: concurrent decoder init is now unconditional on all platforms (original rationale for the Windows-only guard is undocumented), and the duplicated _into/_to_slice render methods require parallel maintenance.

crates/rendering/src/lib.rs (removed platform guard for decoder init), crates/editor/src/audio.rs (duplicated _into/_to_slice render methods)

Important Files Changed

Filename	Overview
crates/audio/src/audio_data.rs	Simplified resampler flush loop: all intermediate per-packet flushes removed, replaced by a single final drain. Functionally equivalent for `from_file` but relies on the final flush contract.
crates/audio/src/renderer.rs	Refactored mixing loop for cache-locality: per-track bulk iteration replaces per-sample cross-track loop; `gain` field renamed to `linear_gain` (pre-converted); `valid_output_range` helper correctly handles negative track offsets; well-tested by baseline-matching tests.
crates/editor/src/audio.rs	Added `_to_slice` variants of render methods to avoid allocations; introduced cached clip-offsets with `ClipOffsets::default()` fallback; `timeline_cursor` now caches segment walk state. Duplicated logic between `_into` and `_to_slice` paths creates maintenance risk.
crates/enc-ffmpeg/src/audio/buffered_resampler.rs	Resampler is now skipped when input/output formats already match; passthrough path correctly updates `min_next_pts` and the top-level PTS adjustment guard still applies.
crates/export/src/mp4.rs	RGBA→NV12 via swscale replaced by `cpu_yuv::rgba_to_nv12_fast`; audio render moved after video encode-queue call; channel capacity bumped from 4 to 8; `fill_nv12_frame_direct` tightened to avoid double borrow of frame data.
crates/rendering/src/cpu_yuv.rs	New `rgba_to_nv12_fast` function with scalar and rayon-parallel paths; BT.601 coefficients correct; validation guard returns false without writing on invalid input; baseline-matching tests added.
crates/rendering/src/frame_pipeline.rs	GPU readback submission now precedes the wait on the previous frame, enabling better pipelining; error paths correctly cancel the newly-submitted readback before returning.
crates/rendering/src/lib.rs	Clip-offsets now cached as a `Vec` before the render loop; upfront full zoom precompute removed in favour of per-frame lazy calls (correctly placed at lines 973/615). Platform guard for sequential decoder init removed; original rationale undocumented.
crates/rendering/src/project_recordings.rs	Segment metadata loading parallelised with `rayon::par_iter`; `RefCell` is created inside each closure so `Send` is satisfied; Rayon preserves Vec order on collect.
crates/rendering/src/zoom_focus_interpolation.rs	Tests only added; `lazy_precompute_matches_full_precompute` validates that per-frame `ensure_precomputed_until` produces identical results to upfront full precompute.

Comments Outside Diff (2)

crates/rendering/src/lib.rs, line 312-320 (link)

Platform-specific sequential init removed without explanation

The original code used #[cfg(target_os = "windows")] to run screen_future and camera_future concurrently only on Windows, keeping them sequential on macOS/Linux. This PR removes that guard and makes concurrent init unconditional. If the sequential path existed to work around a macOS-specific constraint (e.g. an AVFoundation or Core Video limitation that requires serial decoder construction on the main thread), running both futures concurrently on non-Windows builds could cause intermittent init failures or deadlocks that are hard to reproduce.

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/rendering/src/lib.rs
Line: 312-320

Comment:
**Platform-specific sequential init removed without explanation**

The original code used `#[cfg(target_os = "windows")]` to run `screen_future` and `camera_future` concurrently only on Windows, keeping them sequential on macOS/Linux. This PR removes that guard and makes concurrent init unconditional. If the sequential path existed to work around a macOS-specific constraint (e.g. an AVFoundation or Core Video limitation that requires serial decoder construction on the main thread), running both futures concurrently on non-Windows builds could cause intermittent init failures or deadlocks that are hard to reproduce.

How can I resolve this? If you propose a fix, please make it concise.

crates/audio/src/audio_data.rs, line 71-101 (link)

Per-packet resampler flush removed; final flush must drain everything

The old code unconditionally flushed the resampler after each packet's receive_frame loop (and again after the conditional inner flush when run() signalled a delay). The new code removes all intermediate flushes and relies entirely on the final loop { resampler.flush() } after send_eof. For the from_file use case this is fine because the final flush loop runs until resample_delay is None, which by libswresample contract means the delay buffer is fully drained. However, the correctness of the simplified path depends entirely on that contract; if the upstream FFmpeg Rust bindings ever change the semantics of flush returning None, audio data could silently be truncated.

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/audio/src/audio_data.rs
Line: 71-101

Comment:
**Per-packet resampler flush removed; final flush must drain everything**

The old code unconditionally flushed the resampler after each packet's `receive_frame` loop (and again after the conditional inner flush when `run()` signalled a delay). The new code removes all intermediate flushes and relies entirely on the final `loop { resampler.flush() }` after `send_eof`. For the `from_file` use case this is fine because the final flush loop runs until `resample_delay` is `None`, which by libswresample contract means the delay buffer is fully drained. However, the correctness of the simplified path depends entirely on that contract; if the upstream FFmpeg Rust bindings ever change the semantics of `flush` returning `None`, audio data could silently be truncated.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
crates/rendering/src/lib.rs:312-320
**Platform-specific sequential init removed without explanation**

The original code used `#[cfg(target_os = "windows")]` to run `screen_future` and `camera_future` concurrently only on Windows, keeping them sequential on macOS/Linux. This PR removes that guard and makes concurrent init unconditional. If the sequential path existed to work around a macOS-specific constraint (e.g. an AVFoundation or Core Video limitation that requires serial decoder construction on the main thread), running both futures concurrently on non-Windows builds could cause intermittent init failures or deadlocks that are hard to reproduce.

### Issue 2 of 3
crates/editor/src/audio.rs:277-340
**Duplicated timeline/linear render logic across `_into` and `_to_slice` variants**

`render_timeline_frame_into` and `render_timeline_frame_to_slice` are nearly identical: both walk the same timeline-cursor state machine and call `render_current_chunk`. The only structural difference is that the `_into` variant resizes a `Vec<f32>` while the `_to_slice` variant takes a pre-sized `&mut [f32]`. The same duplication exists between `render_linear_frame_into` and `render_linear_frame_to_slice`. Any future behavioural fix (e.g. to the `chunk_samples` calculation or the cursor advance) must be applied in both variants independently, increasing the chance of divergence.

### Issue 3 of 3
crates/audio/src/audio_data.rs:71-101
**Per-packet resampler flush removed; final flush must drain everything**

The old code unconditionally flushed the resampler after each packet's `receive_frame` loop (and again after the conditional inner flush when `run()` signalled a delay). The new code removes all intermediate flushes and relies entirely on the final `loop { resampler.flush() }` after `send_eof`. For the `from_file` use case this is fine because the final flush loop runs until `resample_delay` is `None`, which by libswresample contract means the delay buffer is fully drained. However, the correctness of the simplified path depends entirely on that contract; if the upstream FFmpeg Rust bindings ever change the semantics of `flush` returning `None`, audio data could silently be truncated.

_{Reviews (1): Last reviewed commit: "chore(export): add encoder benchmark exa..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

…uration

…ersion

tembo · 2026-05-08T09:42:23Z


+fn f32_slice_from_audio_frame(frame: &mut FFAudio, len: usize) -> &mut [f32] {
+    let bytes = &mut frame.data_mut(0)[..len * f32::BYTE_SIZE];
+    unsafe { std::slice::from_raw_parts_mut(bytes.as_mut_ptr() as *mut f32, len) }


from_raw_parts_mut(bytes.as_mut_ptr() as *mut f32, len) assumes the FFmpeg buffer is always aligned for f32. If that ever regresses, this becomes UB.

Suggested change

unsafe { std::slice::from_raw_parts_mut(bytes.as_mut_ptr() as *mut f32, len) }

fn f32_slice_from_audio_frame(frame: &mut FFAudio, len: usize) -> &mut [f32] {

let bytes = &mut frame.data_mut(0)[..len * f32::BYTE_SIZE];

let ptr = bytes.as_mut_ptr();

assert_eq!(ptr.align_offset(std::mem::align_of::<f32>()), 0);

unsafe { std::slice::from_raw_parts_mut(ptr.cast::<f32>(), len) }

}

tembo · 2026-05-08T09:42:35Z

+        let Some(max_index) = project.clips.iter().map(|clip| clip.index as usize).max() else {
+            self.clip_offsets_by_index.clear();
+            return;
+        };


resize(max_index + 1, …) can become an accidental huge allocation if clip.index is sparse / untrusted (e.g. a project file with a very large index). Since clip_offsets_for_project already has a linear fallback, it seems worth guarding the cache.

Suggested change

};

let Some(max_index) = project.clips.iter().map(|clip| clip.index as usize).max() else {

self.clip_offsets_by_index.clear();

return;

};

if max_index > 100_000 {

self.clip_offsets_by_index.clear();

return;

}

tembo · 2026-05-08T09:42:44Z

+    fn is_valid_for(self, rgba: &[u8], output: &[u8]) -> bool {
+        if self.width == 0 || self.height == 0 {
+            return false;
+        }


uv_height() / width / 2 assumes even dimensions. If a caller ever passes odd width/height, this will silently drop chroma samples (and the buffer size math gets ambiguous). It might be safer to reject odd sizes up front.

Suggested change

}

if self.width == 0 || self.height == 0 || self.width % 2 != 0 || self.height % 2 != 0 {

return false;

}

greptile-apps · 2026-05-08T09:47:54Z

@@ -219,8 +340,8 @@ impl AudioRenderer {
            return None;


Duplicated timeline/linear render logic across _into and _to_slice variants

render_timeline_frame_into and render_timeline_frame_to_slice are nearly identical: both walk the same timeline-cursor state machine and call render_current_chunk. The only structural difference is that the _into variant resizes a Vec<f32> while the _to_slice variant takes a pre-sized &mut [f32]. The same duplication exists between render_linear_frame_into and render_linear_frame_to_slice. Any future behavioural fix (e.g. to the chunk_samples calculation or the cursor advance) must be applied in both variants independently, increasing the chance of divergence.

Prompt To Fix With AI

This is a comment left during a code review. Path: crates/editor/src/audio.rs Line: 277-340 Comment: **Duplicated timeline/linear render logic across `_into` and `_to_slice` variants** `render_timeline_frame_into` and `render_timeline_frame_to_slice` are nearly identical: both walk the same timeline-cursor state machine and call `render_current_chunk`. The only structural difference is that the `_into` variant resizes a `Vec<f32>` while the `_to_slice` variant takes a pre-sized `&mut [f32]`. The same duplication exists between `render_linear_frame_into` and `render_linear_frame_to_slice`. Any future behavioural fix (e.g. to the `chunk_samples` calculation or the cursor advance) must be applied in both variants independently, increasing the chance of divergence. How can I resolve this? If you propose a fix, please make it concise.

richiemcilroy added 17 commits May 8, 2026 10:33

perf(enc-ffmpeg): skip BufferedResampler when conversion is identity

6769618

perf(rendering): add rgba_to_nv12_fast CPU conversion

0efa98f

perf(rendering): parallelize segment meta load and add first_camera_d…

7496ff1

…uration

test(rendering): verify lazy zoom precompute matches full precompute

ea44510

perf(rendering): cache clip offsets and accelerate NV12 readback conv…

cc7684c

…ersion

fix(rendering): order readback wait after submit in finish_encoder paths

5f5d2a3

fix(audio): handle resampler flush loops and propagate decode EOF errors

963db80

perf(audio): batch stereo mix using prepared linear gains

fec7de5

perf(editor): render audio in-place with cached clip offsets

3dab4a4

perf(editor): load segment audio concurrently with decoder setup

f8a3094

refactor(editor): use first_camera_duration for recording length bound

fc33541

chore(editor): re-export AudioSegment from crate root

e9b712a

build(export): add cap-audio dev dependency for benchmarks

2e780d0

perf(export): parallelize renderer setup with segment loading

5db7637

perf(export): speed NV12 pipeline and widen export frame channel

76e2a15

chore(export): add CPU startup profiling example binary

a424daa

chore(export): add encoder benchmark example binary

0598c11

superagent-security Bot added contributor:verified Contributor passed trust analysis. pr:verified PR passed security analysis. labels May 8, 2026

tembo Bot reviewed May 8, 2026

View reviewed changes

greptile-apps Bot reviewed May 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster export pipeline with parallel IO, cached offsets, and CPU RGBA→NV12#1786

Faster export pipeline with parallel IO, cached offsets, and CPU RGBA→NV12#1786
richiemcilroy wants to merge 17 commits into
mainfrom
rendering-bits

richiemcilroy commented May 8, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

tembo Bot May 8, 2026

Uh oh!

tembo Bot May 8, 2026

Uh oh!

tembo Bot May 8, 2026

Uh oh!

greptile-apps Bot May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

richiemcilroy commented May 8, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Comments Outside Diff (2)

Uh oh!

tembo Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

tembo Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

tembo Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

richiemcilroy commented May 8, 2026 •

edited by greptile-apps Bot

Loading