Qualitative samples from the final model recipe (PyramidalWan-DMD-PT*).
Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency.
We start from the pretrained Wan2.1-1.3B model and restructure its diffusion process into three spatiotemporal stages at resolutions 81×448×832, 41×224×416, and 21×112×208 (see figure above).
The model is finetuned using the pyramidal flow matching loss, achieving substantial inference cost reduction with near-original quality.
We further analyze step distillation strategies within the pyramidal setup for both conventional and pyramidal teachers and show that they can achieve strong results.
We also demonstrate for the first time that recently proposed Pyramidal Patchification models (an alternative to Pyramidal Flow) can be successfully trained for few-step video generation.
In addition to this empirical study, we present a theoretical generalization of the resolution transition operations introduced in Pyramidal Flow. Specifically, we extend these operations to arbitrary upsampling and downsampling functions based on orthogonal transforms. Notably, average pooling and nearest-neighbor upsampling, employed in the original work, can be interpreted as scaled instances of the Haar wavelet operator, fitting within our generalized framework.
In summary, our contributions are as follows:
Qualitative results for the main baselines and the final model configuration (PyramidalWan-DMD-PT*). The first two rows show the original Wan model with 50 and 25 steps. The third row presents a PyramidalWan
(20-20-10 steps), followed by Wan-DMD (2 steps) and our final recipe, PyramidalWan-DMD-PT*. As illustrated, the proposed approach achieves video quality comparable to the baselines while requiring significantly
less compute.
A person is squat.
A horse.
A bigfoot walking in the snowstorm.
A pink bird.
A jellyfish floating through the ocean...
Yoda playing guitar on the stage.
Qualitative results for step distillation (Adversarial and DMD) applied to the original Wan model with varying numbers of sampling steps.
Videos generated with 4 and 2 steps maintain good quality, whereas performance degrades noticeably at just 1 step.
A person is squat.
A horse.
A bigfoot walking in the snowstorm.
A pink bird.
A jellyfish floating through the ocean...
Yoda playing guitar on the stage.
Qualitative results for step distillation (Adversarial and DMD) applied to the Pyramidal Wan model.
Adversarial distillation tends to produce more realistic-looking videos, whereas DMD outputs often appear more stylized
or cartoon-like. However, DMD achieves better and more consistent motion, while adversarial methods suffer from motion artifacts.
We note that PyramidalWan-DMD-OT, which closely follows one of the concurrent works,
tends to produce videos with oversaturated colors.
Overall, the highest quality is achieved by PyramidalWan-DMD-PT*, which we recommend as the final model recipe.
A person is squat.
A horse.
A bigfoot walking in the snowstorm.
A pink bird.
A jellyfish floating through the ocean...
Yoda playing guitar on the stage.
Qualitative results for the Wan-DMD model with 2 steps, shown with and without the training-free acceleration method Jenga, alongside the final model recipe.
Although Jenga augmented models perform well on automated metrics, qualitative results reveal severe scene and motion artifacts.
A person is squat.
A horse.
A bigfoot walking in the snowstorm.
A pink bird.
A jellyfish floating through the ocean...
Yoda playing guitar on the stage.
@article{korzhenkov2026pyramidalwan,
title={PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference},
author={Korzhenkov, Denis and Karjauv, Adil and Karnewar, Animesh and Ghafoorian, Mohsen and Habibian, Amirhossein},
journal={arXiv preprint arXiv:2601.04792},
year={2026}
}