Temporal tile size + overlap#1490
Conversation
c5bc7d0 to
998af1c
Compare
|
I automated some tests and asked Gemma4 to report on the results (maybe I should have used a linear increase in tile size, that would have been more interesting in hindsight). Using rocm backend, RX 6800 the report: LTX Video VAE Temporal Tiling AnalysisThis breakdown analyzes the performance and resource impact of different temporal tiling configurations during the VAE decode process for a 512x512 video with 17 total frames (it's 17 latent frames, so 129 frames). Performance Data Summary
*Shared Memory: VRAM usage exceeded the 16GB of the GPU, causing a significant performance drop. Key Observations1. The "VRAM Wall" (16GB Threshold)There is a massive performance cliff once the compute buffer exceeds 16GB.
2. Padding vs. PerformancePadding is essential for visual quality (preventing "choppy" transitions), but it comes at a linear cost to time:
3. Tile Size vs. Quality/Memory
Parameter Selection GuideIf you are configuring your VAE decoding, use this logic to balance speed and quality:
|
Targets #1463
This PR improves the temporal tiled decoding for the LTX2.3 Video VAE by adding support for custom tile sizes (processing multiple latent frames per batch) and temporal overlap/padding to ensure smoother transitions between "tiles" as the VAE decoder has access to the near future.
For now it ses env variables
VAE_TILE_FRAMESandVAE_TILE_PADto control the effect. In my experience, settingVAE_TILE_FRAMES=4andVAE_TILE_PAD=1seems to give very decent results (see comments).How it works:
When using temporal tiling, instead of processing latent frames sequencially, each being able to access only its predecessors, the latent frames are batched together in chunks/tiles of size
VAE_TILE_FRAMESto be decoded at once, still with access to the previous batches. The lastVAE_TILE_PADlatent frames of each batch are discarded (and removed from the cache), and then re-processed as part of the next batch.Quick comparison of compute buffer VRAM usage for 512x512x129 (17 latent frames) for various temporal tile sizes with 0 padding (padding seems to slightly decrease the compute buffer size, but might make decoding a bit longer)