Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, prior attempts fail to match the expressiveness of softmax attention without costly retraining. We introduce Attention Surgery, an efficient framework for linearizing or hybridizing attention in pretrained VDMs without training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art DiT-based VDM, Attention Surgery achieves the first competitive sub-quadratic attention video diffusion models, reducing attention cost by up to 40% in terms of FLOPs, while maintaining generation quality as measured on the standard VBench and VBench-2.0 benchmarks.
We propose an efficient attention surgery strategy — eliminating the need for extensive retraining from scratch — coupled with a novel efficient hybrid attention architecture inspired by recent developments in language modeling. Intuitively, if a small subset of tokens retains full softmax attention while the rest use linear attention, the model can preserve global structure and fine-grained dependencies where needed, while scaling efficiently elsewhere.
Our approach significantly narrows the quality gap between linearized and full softmax attention while achieving higher efficiency than the original softmax attention models. Importantly, it can be realized with modest compute — requiring less than 0.4k GPU hours for the overall surgery — making it practical for a wide range of research and industrial settings. We validate our method on Wan2.1, a state-of-the-art video diffusion model, demonstrating that our contributions are successfully applicable to transformer-based diffusion models.
Major contributions:
@article{ghafoorian2025attention,
title={Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer},
author={Ghafoorian, Mohsen and Korzhenkov, Denis and Habibian, Amirhossein},
journal={arXiv preprint arXiv:2509.24899},
year={2025}
}