PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference

Qualitative samples from the final model recipe (PyramidalWan-DMD-PT*).

A panda drinking coffee in a cafe in Paris, in cyberpunk style.

Hyper-realistic spaceship landing on Mars.

Vampire makeup face of beautiful girl, red contact lenses.

A person drinking coffee in a cafe.

A raccoon that looks like a turtle, digital art.

A person is playing trumpet.

Turtle swimming in ocean.

Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency.

We start from the pretrained Wan2.1-1.3B model and restructure its diffusion process into three spatiotemporal stages at resolutions 81×448×832, 41×224×416, and 21×112×208 (see figure above). The model is finetuned using the pyramidal flow matching loss, achieving substantial inference cost reduction with near-original quality. We further analyze step distillation strategies within the pyramidal setup for both conventional and pyramidal teachers and show that they can achieve strong results. We also demonstrate for the first time that recently proposed Pyramidal Patchification models (an alternative to Pyramidal Flow) can be successfully trained for few-step video generation.

In addition to this empirical study, we present a theoretical generalization of the resolution transition operations introduced in Pyramidal Flow. Specifically, we extend these operations to arbitrary upsampling and downsampling functions based on orthogonal transforms. Notably, average pooling and nearest-neighbor upsampling, employed in the original work, can be interpreted as scaled instances of the Haar wavelet operator, fitting within our generalized framework.

In summary, our contributions are as follows:

We show that a conventional video diffusion transformer can be effectively converted into a spatiotemporal pyramidal diffusion model with minimal finetuning cost and without compromising quality.
We conduct a systematic study of step distillation techniques within the pyramidal setup, offering practical insights for various training scenarios.
We extend the procedure of transition between stages in the PyramidalFlow framework to a broader class of upsampling functions.

Qualitative results for the main baselines and the final model configuration (PyramidalWan-DMD-PT*). The first two rows show the original Wan model with 50 and 25 steps. The third row presents a PyramidalWan
(20-20-10 steps), followed by Wan-DMD (2 steps) and our final recipe, PyramidalWan-DMD-PT*. As illustrated, the proposed approach achieves video quality comparable to the baselines while requiring significantly less compute.

A bigfoot walking in the snowstorm.

A jellyfish floating through the ocean...

Yoda playing guitar on the stage.

Qualitative results for step distillation (Adversarial and DMD) applied to the original Wan model with varying numbers of sampling steps.
Videos generated with 4 and 2 steps maintain good quality, whereas performance degrades noticeably at just 1 step.

A bigfoot walking in the snowstorm.

A jellyfish floating through the ocean...

Yoda playing guitar on the stage.

Qualitative results for step distillation (Adversarial and DMD) applied to the Pyramidal Wan model. Adversarial distillation tends to produce more realistic-looking videos, whereas DMD outputs often appear more stylized
or cartoon-like. However, DMD achieves better and more consistent motion, while adversarial methods suffer from motion artifacts. We note that PyramidalWan-DMD-OT, which closely follows one of the concurrent works,
tends to produce videos with oversaturated colors. Overall, the highest quality is achieved by PyramidalWan-DMD-PT*, which we recommend as the final model recipe.

A bigfoot walking in the snowstorm.

A jellyfish floating through the ocean...

Yoda playing guitar on the stage.

Qualitative results for the Wan-DMD model with 2 steps, shown with and without the training-free acceleration method Jenga, alongside the final model recipe.
Although Jenga augmented models perform well on automated metrics, qualitative results reveal severe scene and motion artifacts.