ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

Qualcomm AI Research

*Sample qualitative generated videos from our hybrid attention models distilled from Wan2.1 1.3B.

Abstract

Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt’s hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to $\sim$160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation.

Method

A comparison of our proposed Recurrent Hybrid Attention model with Wan2.1 bidirectional full softmax attention. Top: Compute complexity increase with video duration growth (left: FLOPs, right: phone latency). Bottom: comparing our hybrid model (20xReHyAt blocks) with original Wan2.1 1.3B, qualitatively and quantitatively. Prompt: "A cat drinking water"

We introduce ReHyAt, a recurrent hybrid attention mechanism tailored for video diffusion. Our key insight is that preserving softmax attention for a small subset of tokens—those most critical for modeling local dependencies—while applying linear attention globally enables modeling long-range and high fidelity local dependencies while ensuring linear efficiency. We propose a temporally chunked hybrid attention design with overlapping chunks to maintain temporal coherence, and show that this formulation can be reformulated into a chunk-wise RNN with constant memory complexity. Furthermore, we leverage a two-stage training pipeline—attention distillation from a bidirectional softmax teacher followed by lightweight fine-tuning—that achieves SOTA results within fewer than 200 GPU-hours. We validate our approach by transforming Wan2.1 into its recurrent hybrid counterpart and evaluate on VBench, VBench2.0, and a human preference study, demonstrating that ReHyAt delivers near state-of-the-art quality with dramatically reduced compute.

Major contributions:

  • We propose ReHyAt, a novel temporally chunked hybrid attention mechanism that combines local softmax attention with global linear attention. This design preserves high-fidelity modeling of critical dependencies within and across adjacent frames while reducing overall complexity to linear time.
  • We derive a chunk-wise recurrent reformulation of ReHyAt, computationally enabling generation of arbitrarily long videos with constant memory usage and efficient inference.
  • Through extensive empirical evaluations and ablation studies, we show that a state-of-the-art bidirectional Softmax attention video diffusion model can be transformed into a chunk-wise recurrent model, only within a few hundred GPU-hours, with negligible impact on the quality.
Overview of the temporally chunked hybrid attention arrangement without (left) and with chunk overlap (right).

Results

The total DiT FLOPs percentages versus the VBench score of original Wan2.1 1.3B model compared to various hybrid configurations or 320x480 (left) and 480x832 (right) resolutions.

Comparisons with SOTA efficient video diffusion models on VBench. `Wan2.1*' is our best reproduction using our evaluation pipeline.

BibTeX

@article{ghafoorian2026rehyat,
  title={ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers},
  author={Ghafoorian, Mohsen and Habibian, Amirhossein},
  journal={arXiv preprint arXiv:2601.04342},
  year={2026}
}