Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

*Sample qualitative generated videos from our hybrid attention models distilled from Wan2.1 1.3B.

Abstract

Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, previous approaches have failed to match the expressiveness of softmax attention unless retrained at significant computational cost. We introduce Attention Surgery, an efficient framework that enables linear or hybrid attention in pretrained VDMs, eliminating the need for training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art efficient transformer VDM and evaluated on VBench, VBench2.0 and a human preference study, Attention Surgery achieves competitive results. Furthermore, measurements of on-mobile latency, memory usage, and FLOPs demonstrate notable improvements in scaling behavior for longer videos.

Method

We propose an efficient attention surgery strategy — eliminating the need for extensive retraining from scratch — coupled with a novel efficient hybrid attention architecture inspired by recent developments in language modeling. Intuitively, if a small subset of tokens retains full softmax attention while the rest use linear attention, the model can preserve global structure and fine-grained dependencies where needed, while scaling efficiently elsewhere.

Our approach significantly narrows the quality gap between linearized and full softmax attention while achieving higher efficiency than the original softmax attention models. Importantly, it can be realized with modest compute — requiring less than 0.4k GPU hours for the overall surgery — making it practical for a wide range of research and industrial settings. We validate our method on Wan2.1, a state-of-the-art video diffusion model, demonstrating that our contributions are successfully applicable to transformer-based diffusion models.

Major contributions:

We introduce attention surgery, an efficient recipe that enables achieving competitive linear/hybrid models within only a few GPU days training on modestly-sized training datasets, liberalizing the process of such significant architectural operations.
We propose a novel hybrid attention formulation with components carefully designed taking the intrinsics of videos into consideration.
We propose a novel block-rate optimization strategy that adjusts the attention configuration of each block based on its transformation complexity, achieving the best accuracy–efficiency trade-off within a given compute budget.

BibTeX

@article{ghafoorian2025attention,
  title={Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer},
  author={Ghafoorian, Mohsen and Korzhenkov, Denis and Habibian, Amirhossein},
  journal={arXiv preprint arXiv:2509.24899},
  year={2025}
}

Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Qualcomm AI Research

Abstract

Method

Results

BibTeX