Abstract

We introduce Neodragon, a text-to-video system capable of generating 2s (49 frames @24 fps) videos at a resolution of \(\texttt{[640×1024]}\) directly on a Qualcomm Hexagon NPU in a record ~6.7s (7 FPS). Differing from existing transformer-based offline text-to-video generation models, Neodragon is the first to have been specifically optimised for mobile hardware to achieve efficient, low-cost, and high-fidelity video synthesis.

  • Replacing the original large 4.762B \(\mathit{T5}_\text{XXL}\) Text-Encoder with a much smaller 0.2B \(\mathit{DT5}\) (DistilT5) with minimal quality loss, enabling the entire model to run without CPU offloading. This is enabled through a novel Text-Encoder Distillation procedure which uses only generative text-prompt data and does not require any image or video data.
  • Proposing an Asymmetric Decoder Distillation approach which allows us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the video generation pipeline.
  • Pruning of MMDiT blocks within the denoiser backbone based on their relative importance, with recovery of original performance through a two-stage distillation process.
  • Reducing the NFE (Neural Functional Evaluation) requirement of the denoiser by performing step distillation using a technique adapted from DMD for pyramidal flow-matching, thereby significantly accelerating video generation.

When paired with an optimised SSD1B first-frame image generator and QuickSRNet for \(2\!\times\!\) super-resolution, our end-to-end Neodragon system becomes a highly parameter (4.945B full model), memory (3.5GB peak RAM usage), and runtime (6.7s E2E latency) efficient mobile-friendly model, while achieving a VBench total score of 81.61, yielding high-fidelity generated videos.

By enabling low-cost, private, and on-device text-to-video synthesis, Neodragon democratizes AI-based video content creation, empowering creators to generate high-quality videos without reliance on cloud services. Code and model will be made publicly available at our website.

Qualitative Video Results

Please refer to the Arxiv Technical Report for details about our proposed optimisations and the technical details of our End-to-End Mobile Video Generation pipeline. In the next few sections, we present the qualitative results of our proposed optimisations in video format (since technical report only contains static images). in order to demonstrate, the overall look and feel, and the spatio-temporal consistency of the generated videos. Thus in short, the effectiveness of our proposed optimisations leading to the final Neodragon Mobile VDM.

Text-Encoder Distillation Qualitative Results

[RM]: Replace-Mode
[EM]: Extend-Mode
[LORA]: LoRA-Mode
[TDT5]: Trainable DistilT5

Video results for the qualitative evaluation of our proposed Text-Encoder Distillation approach. Note that the SSD-1B first frame generator is not present for these results, and the generated videos are at native (non super-resolved) resolution of [49 x 320 x 512].

Asymmetric Decoder Distillation Qualitative Results

Pyramidal-Flow Native decoder
Our Modified TinyAEHV decoder

MMDiT Block Pruning Qualitative Results

24-Blocks: Baseline
22-Blocks
20-Blocks
18-Blocks
16-Blocks

Video results for the qualitative evaluation of Stage-1 finetuning of Block Pruning. This shows the effectiveness of our block selection for pruning approach. Although the technique is simple, but it can be very effectively applied to obtain a control over the trade-off between quality v/s model size for the MMDiT denoiser.

24-Blocks: Baseline
18-Blocks: Before Fine-Tuning
18-Blocks: Stage-1
18-Blocks: Stage-2

Video results for the qualitative evaluation of Stage-2 finetuning of Block Pruning. We show how much qualitative difference applying Stage-1 finetuning and followed by Stage-2 finetuning to the 18-blocks pruned model. As can be seen, the quality improves significantly after Stage-2 finetuning enabling near loss-less model compression.

Step Distillation Qualitative Results

Pyramidal-Flow Baseline
Pyramidal Mean-Flows
Pyramidal DMD (Ours)
Pyramidal Progressive
Pyramidal Adversarial

These videos show the qualitative results of our Step Distillation experiments. Among all methods, Pyramidal-DMD best preserves motion dynamics, though it introduces some color saturation and semantic artifacts. Pyramidal-Mean Flows also produces promising results, but achieves lower VBench scores compared to Pyramidal-DMD, so we select DMD as our final approach. To address limitations in first-frame generation with Pyramidal-DMD, we use SSD-1B as the first-frame generator.

Adobe Premiere Pro Plugin Demo

We present a demo of our Neodragon Mobile VDM as a plugin inside Adobe Premiere Pro. This plugin is being run on a laptop device which has a Qualcomm Snapdragon X Elite SoC, and containing the same Hexagon NPU that is present in Mobile devices running on Snapdragon 8 Gen 4 SoC. As shown by the TaskManager, the text-to-video model runs on the Hexagon NPU, and the entire video generation pipeline runs on device to generate the video clip. This Plugin demonstrates how our Mobile VDM can be easily integrated into existing video editing software, therby accelerating various creative workflows.

BibTeX

@article{karnewar2025neodragon,
    author  = {Animesh Karnewar and Denis Korzhenkov and Ioannis Lelekas and Noor Fathima and Adil Karjauv and Hanwen Xiong, Vancheeswaran Vaidyanathan and Will Zeng and Rafael Esteves and Tushar Singhal and Fatih Porikli and Mohsen Ghafoorian and Amirhossein Habibian},
    title   = {Neodragon: Mobile Video Generation using Diffusion Transformer},
    journal = {arXiv preprint arXiv:2511.06055},
    year    = {2025},
  }