Temporal tile size + overlap by stduhpf · Pull Request #1490 · leejet/stable-diffusion.cpp

stduhpf · 2026-05-12T22:27:34Z

Targets #1463

This PR improves the temporal tiled decoding for the LTX2.3 Video VAE by adding support for custom tile sizes (processing multiple latent frames per batch) and temporal overlap/padding to ensure smoother transitions between "tiles" as the VAE decoder has access to the near future.

For now it ses env variables VAE_TILE_FRAMES and VAE_TILE_PAD to control the effect. In my experience, setting VAE_TILE_FRAMES=4 and VAE_TILE_PAD=1 seems to give very decent results (see comments).

How it works:
When using temporal tiling, instead of processing latent frames sequencially, each being able to access only its predecessors, the latent frames are batched together in chunks/tiles of size VAE_TILE_FRAMES to be decoded at once, still with access to the previous batches. The last VAE_TILE_PAD latent frames of each batch are discarded (and removed from the cache), and then re-processed as part of the next batch.

Quick comparison of compute buffer VRAM usage for 512x512x129 (17 latent frames) for various temporal tile sizes with 0 padding (padding seems to slightly decrease the compute buffer size, but might make decoding a bit longer)

Tile Size (frames)	Compute Buffer Size (VRAM)
1	3,271.25 MB
2	5,525.38 MB
4	10,365.62 MB
8	19,744.12 MB
No Tiling	19,118.13 MB

stduhpf · 2026-05-13T01:36:37Z

I automated some tests and asked Gemma4 to report on the results (maybe I should have used a linear increase in tile size, that would have been more interesting in hindsight). Using rocm backend, RX 6800

the report:

LTX Video VAE Temporal Tiling Analysis

This breakdown analyzes the performance and resource impact of different temporal tiling configurations during the VAE decode process for a 512x512 video with 17 total frames (it's 17 latent frames, so 129 frames).

Performance Data Summary

Tile Size (Frames)	Padding (Frames)	Total Tiles	VRAM Usage	Execution Time	Status
No Tiling	N/A	1	19.12 GB	38,754 ms	⚠️ Shared Mem*
1	0	17	3.27 GB	16,946 ms	✅ Success
2	0	9	5.53 GB	17,284 ms	✅ Success
2	1	17	5.53 GB	33,066 ms	✅ Success
4	0	5	10.37 GB	18,020 ms	✅ Success
4	1	6	10.35 GB	23,478 ms	✅ Success
4	2	9	10.38 GB	33,202 ms	✅ Success
4	3	17	10.19 GB	60,915 ms	✅ Success
8	0	3	19.74 GB	40,638 ms	⚠️ Shared Mem*
8	1	3	19.70 GB	43,205 ms	⚠️ Shared Mem*
8	2	3	19.68 GB	48,136 ms	⚠️ Shared Mem*
8	3	4	19.65 GB	51,959 ms	⚠️ Shared Mem*
8	4	5	19.80 GB	70,294 ms	⚠️ Shared Mem*
8	5	6	19.73 GB	75,515 ms	⚠️ Shared Mem*
8	6	9	19.64 GB	115,045 ms	⚠️ Shared Mem*
8	7	17	19.80 GB	199,870 ms	⚠️ Shared Mem*
16	any	2-6	~36.36 GB	N/A	❌ OOM

*Shared Memory: VRAM usage exceeded the 16GB of the GPU, causing a significant performance drop.

Key Observations

1. The "VRAM Wall" (16GB Threshold)

There is a massive performance cliff once the compute buffer exceeds 16GB.

Below 16GB: Processing is fast and efficient.
Above 16GB: The system resorts to shared memory, which drastically increases execution time. For example, moving from tile_sz=4 to tile_sz=8 more than doubles the base processing time, even with zero padding.

2. Padding vs. Performance

Padding is essential for visual quality (preventing "choppy" transitions), but it comes at a linear cost to time:

More Padding = More Tiles: As padding increases, the number of tiles required to cover the same video length increases.
Time Penalty: In the tile_sz=4 tests, increasing padding from 0 to 3 frames increased the time from ~18s to ~61s.

3. Tile Size vs. Quality/Memory

Memory: Larger tile sizes exponentially increase VRAM requirements. A tile size of 16 was too large for the available hardware, resulting in Out-of-Memory (OOM) errors.
Quality: While larger tiles and higher padding generally yield better results, the testing suggests a "diminishing returns" point where padding=1 is nearly sufficient for acceptable quality.

Parameter Selection Guide

If you are configuring your VAE decoding, use this logic to balance speed and quality:

Priority	Recommended Setting	Reasoning
Max Quality	No Tiling	Best visual result; requires $\approx$ 20GB VRAM.
High Quality (Mid VRAM)	Tile Size 4 / Padding 1	Good balance. Stays under 16GB, maintains smooth transitions.
Max Speed / Low VRAM	Tile Size 1 / Padding 0	Fastest possible decode; lowest VRAM, but results will be choppy.
Quality first / Low VRAM	Tile Size 2 / Padding 1	Much slower, but still low VRAM without the choppy transitions
Safety First	Tile Size $\leq$ 4	Ensures the process stays within dedicated VRAM on 16GB cards to avoid shared memory slowdowns.

Temporal tile size + overlap

998af1c

stduhpf force-pushed the temporal-tiling-improvements branch from c5bc7d0 to 998af1c Compare May 12, 2026 23:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temporal tile size + overlap#1490

Temporal tile size + overlap#1490
stduhpf wants to merge 1 commit into
leejet:ltx2.3from
stduhpf:temporal-tiling-improvements

stduhpf commented May 12, 2026 •

edited

Loading

Uh oh!

stduhpf commented May 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stduhpf commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stduhpf commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LTX Video VAE Temporal Tiling Analysis

Performance Data Summary

Key Observations

1. The "VRAM Wall" (16GB Threshold)

2. Padding vs. Performance

3. Tile Size vs. Quality/Memory

Parameter Selection Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

stduhpf commented May 12, 2026 •

edited

Loading

stduhpf commented May 13, 2026 •

edited

Loading