Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer

Qualcomm AI Research

*Sample qualitative generated videos from our hybrid attention models distilled from Wan2.1 1.3B.

Abstract

Transformer-based video diffusion models (VDMs) deliver state-of-the-art video generation quality but are constrained by the quadratic cost of self-attention, making long sequences and high resolutions computationally expensive. While linear attention offers sub-quadratic complexity, prior attempts fail to match the expressiveness of softmax attention without costly retraining. We introduce Attention Surgery, an efficient framework for linearizing or hybridizing attention in pretrained VDMs without training from scratch. Inspired by recent advances in language models, our method combines a novel hybrid attention mechanism-mixing softmax and linear tokens-with a lightweight distillation and fine-tuning pipeline requiring only a few GPU-days. Additionally, we incorporate a cost-aware block-rate strategy to balance expressiveness and efficiency across layers. Applied to Wan2.1 1.3B, a state-of-the-art DiT-based VDM, Attention Surgery achieves the first competitive sub-quadratic attention video diffusion models, reducing attention cost by up to 40% in terms of FLOPs, while maintaining generation quality as measured on the standard VBench and VBench-2.0 benchmarks.

Method

Impact of the proposed method components: attention distillation and hybrid attention. Prompt: "An astronaut flying in space, Van Gogh style."

We propose an efficient attention surgery strategy — eliminating the need for extensive retraining from scratch — coupled with a novel efficient hybrid attention architecture inspired by recent developments in language modeling. Intuitively, if a small subset of tokens retains full softmax attention while the rest use linear attention, the model can preserve global structure and fine-grained dependencies where needed, while scaling efficiently elsewhere.

Our approach significantly narrows the quality gap between linearized and full softmax attention while achieving higher efficiency than the original softmax attention models. Importantly, it can be realized with modest compute — requiring less than 0.4k GPU hours for the overall surgery — making it practical for a wide range of research and industrial settings. We validate our method on Wan2.1, a state-of-the-art video diffusion model, demonstrating that our contributions are successfully applicable to transformer-based diffusion models.

Major contributions:

  • We introduce attention surgery, an efficient recipe that enables achieving competitive linear/hybrid models within only a few GPU days training on modestly-sized training datasets, liberalizing the process of such significant architectural operations.
  • We propose a novel hybrid attention formulation with components carefully designed taking the intrinsics of videos into consideration.
  • We propose a novel block-rate optimization strategy that adjusts the attention configuration of each block based on its transformation complexity, achieving the best accuracy–efficiency trade-off within a given compute budget.

Results

The total DiT FLOPs percentages versus the VBench score of original Wan2.1 1.3B model compared to various hybrid configurations or 320x480 (left) and 480x832 (right) resolutions.

Comparisons with SOTA efficient video diffusion models. All metrics are extracted from reported numbers, except for `Wan2.1*`, which is our reproduction using the same evaluation pipeline and parameters as our variations.

BibTeX

@article{ghafoorian2025attention,
  title={Attention Surgery: An Efficient Recipe to Linearize Your Video Diffusion Transformer},
  author={Ghafoorian, Mohsen and Korzhenkov, Denis and Habibian, Amirhossein},
  journal={arXiv preprint arXiv:2509.24899},
  year={2025}
}