Skip to content

Temporal tile size + overlap#1490

Open
stduhpf wants to merge 1 commit into
leejet:ltx2.3from
stduhpf:temporal-tiling-improvements
Open

Temporal tile size + overlap#1490
stduhpf wants to merge 1 commit into
leejet:ltx2.3from
stduhpf:temporal-tiling-improvements

Conversation

@stduhpf
Copy link
Copy Markdown
Contributor

@stduhpf stduhpf commented May 12, 2026

Targets #1463

This PR improves the temporal tiled decoding for the LTX2.3 Video VAE by adding support for custom tile sizes (processing multiple latent frames per batch) and temporal overlap/padding to ensure smoother transitions between "tiles" as the VAE decoder has access to the near future.

For now it ses env variables VAE_TILE_FRAMES and VAE_TILE_PAD to control the effect. In my experience, setting VAE_TILE_FRAMES=4 and VAE_TILE_PAD=1 seems to give very decent results (see comments).

How it works:
When using temporal tiling, instead of processing latent frames sequencially, each being able to access only its predecessors, the latent frames are batched together in chunks/tiles of size VAE_TILE_FRAMES to be decoded at once, still with access to the previous batches. The last VAE_TILE_PAD latent frames of each batch are discarded (and removed from the cache), and then re-processed as part of the next batch.

Quick comparison of compute buffer VRAM usage for 512x512x129 (17 latent frames) for various temporal tile sizes with 0 padding (padding seems to slightly decrease the compute buffer size, but might make decoding a bit longer)

Tile Size (frames) Compute Buffer Size (VRAM)
1 3,271.25 MB
2 5,525.38 MB
4 10,365.62 MB
8 19,744.12 MB
No Tiling 19,118.13 MB

@stduhpf stduhpf force-pushed the temporal-tiling-improvements branch from c5bc7d0 to 998af1c Compare May 12, 2026 23:47
@stduhpf
Copy link
Copy Markdown
Contributor Author

stduhpf commented May 13, 2026

I automated some tests and asked Gemma4 to report on the results (maybe I should have used a linear increase in tile size, that would have been more interesting in hindsight). Using rocm backend, RX 6800

the report:


LTX Video VAE Temporal Tiling Analysis

This breakdown analyzes the performance and resource impact of different temporal tiling configurations during the VAE decode process for a 512x512 video with 17 total frames (it's 17 latent frames, so 129 frames).

Performance Data Summary

Tile Size (Frames) Padding (Frames) Total Tiles VRAM Usage Execution Time Status
No Tiling N/A 1 19.12 GB 38,754 ms ⚠️ Shared Mem*
1 0 17 3.27 GB 16,946 ms ✅ Success
2 0 9 5.53 GB 17,284 ms ✅ Success
2 1 17 5.53 GB 33,066 ms ✅ Success
4 0 5 10.37 GB 18,020 ms ✅ Success
4 1 6 10.35 GB 23,478 ms ✅ Success
4 2 9 10.38 GB 33,202 ms ✅ Success
4 3 17 10.19 GB 60,915 ms ✅ Success
8 0 3 19.74 GB 40,638 ms ⚠️ Shared Mem*
8 1 3 19.70 GB 43,205 ms ⚠️ Shared Mem*
8 2 3 19.68 GB 48,136 ms ⚠️ Shared Mem*
8 3 4 19.65 GB 51,959 ms ⚠️ Shared Mem*
8 4 5 19.80 GB 70,294 ms ⚠️ Shared Mem*
8 5 6 19.73 GB 75,515 ms ⚠️ Shared Mem*
8 6 9 19.64 GB 115,045 ms ⚠️ Shared Mem*
8 7 17 19.80 GB 199,870 ms ⚠️ Shared Mem*
16 any 2-6 ~36.36 GB N/A ❌ OOM

*Shared Memory: VRAM usage exceeded the 16GB of the GPU, causing a significant performance drop.


Key Observations

1. The "VRAM Wall" (16GB Threshold)

There is a massive performance cliff once the compute buffer exceeds 16GB.

  • Below 16GB: Processing is fast and efficient.
  • Above 16GB: The system resorts to shared memory, which drastically increases execution time. For example, moving from tile_sz=4 to tile_sz=8 more than doubles the base processing time, even with zero padding.

2. Padding vs. Performance

Padding is essential for visual quality (preventing "choppy" transitions), but it comes at a linear cost to time:

  • More Padding = More Tiles: As padding increases, the number of tiles required to cover the same video length increases.
  • Time Penalty: In the tile_sz=4 tests, increasing padding from 0 to 3 frames increased the time from ~18s to ~61s.

3. Tile Size vs. Quality/Memory

  • Memory: Larger tile sizes exponentially increase VRAM requirements. A tile size of 16 was too large for the available hardware, resulting in Out-of-Memory (OOM) errors.
  • Quality: While larger tiles and higher padding generally yield better results, the testing suggests a "diminishing returns" point where padding=1 is nearly sufficient for acceptable quality.

Parameter Selection Guide

If you are configuring your VAE decoding, use this logic to balance speed and quality:

Priority Recommended Setting Reasoning
Max Quality No Tiling Best visual result; requires $\approx$ 20GB VRAM.
High Quality (Mid VRAM) Tile Size 4 / Padding 1 Good balance. Stays under 16GB, maintains smooth transitions.
Max Speed / Low VRAM Tile Size 1 / Padding 0 Fastest possible decode; lowest VRAM, but results will be choppy.
Quality first / Low VRAM Tile Size 2 / Padding 1 Much slower, but still low VRAM without the choppy transitions
Safety First Tile Size $\leq$ 4 Ensures the process stays within dedicated VRAM on 16GB cards to avoid shared memory slowdowns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant