Neodragon

We introduce Neodragon, a text-to-video system capable of generating 2s (49 frames @24 fps) videos at a resolution of \(\texttt{[640×1024]}\) directly on a Qualcomm Hexagon NPU in a record ~6.7s (7 FPS). Differing from existing transformer-based offline text-to-video generation models, Neodragon is the first to have been specifically optimised for mobile hardware to achieve efficient, low-cost, and high-fidelity video synthesis.

Replacing the original large 4.762B \(\mathit{T5}_\text{XXL}\) Text-Encoder with a much smaller 0.2B \(\mathit{DT5}\) (DistilT5) with minimal quality loss, enabling the entire model to run without CPU offloading. This is enabled through a novel Text-Encoder Distillation procedure which uses only generative text-prompt data and does not require any image or video data.
Proposing an Asymmetric Decoder Distillation approach which allows us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the video generation pipeline.
Pruning of MMDiT blocks within the denoiser backbone based on their relative importance, with recovery of original performance through a two-stage distillation process.
Reducing the NFE (Neural Functional Evaluation) requirement of the denoiser by performing step distillation using a technique adapted from DMD for pyramidal flow-matching, thereby significantly accelerating video generation.

When paired with an optimised SSD1B first-frame image generator and QuickSRNet for \(2\!\times\!\) super-resolution, our end-to-end Neodragon system becomes a highly parameter (4.945B full model), memory (3.5GB peak RAM usage), and runtime (6.7s E2E latency) efficient mobile-friendly model, while achieving a VBench total score of 81.61, yielding high-fidelity generated videos.

By enabling low-cost, private, and on-device text-to-video synthesis, Neodragon democratizes AI-based video content creation, empowering creators to generate high-quality videos without reliance on cloud services. Code and model will be made publicly available at our website.

Please refer to the Arxiv Technical Report for details about our proposed optimisations and the technical details of our End-to-End Mobile Video Generation pipeline. In the next few sections, we present the qualitative results of our proposed optimisations in video format (since technical report only contains static images). in order to demonstrate, the overall look and feel, and the spatio-temporal consistency of the generated videos. Thus in short, the effectiveness of our proposed optimisations leading to the final Neodragon Mobile VDM.

Video results for the qualitative evaluation of our proposed Text-Encoder Distillation approach. Note that the SSD-1B first frame generator is not present for these results, and the generated videos are at native (non super-resolved) resolution of [49 x 320 x 512].

Video results for the qualitative evaluation of Stage-1 finetuning of Block Pruning. This shows the effectiveness of our block selection for pruning approach. Although the technique is simple, but it can be very effectively applied to obtain a control over the trade-off between quality v/s model size for the MMDiT denoiser.

Video results for the qualitative evaluation of Stage-2 finetuning of Block Pruning. We show how much qualitative difference applying Stage-1 finetuning and followed by Stage-2 finetuning to the 18-blocks pruned model. As can be seen, the quality improves significantly after Stage-2 finetuning enabling near loss-less model compression.

These videos show the qualitative results of our Step Distillation experiments. Among all methods, Pyramidal-DMD best preserves motion dynamics, though it introduces some color saturation and semantic artifacts. Pyramidal-Mean Flows also produces promising results, but achieves lower VBench scores compared to Pyramidal-DMD, so we select DMD as our final approach. To address limitations in first-frame generation with Pyramidal-DMD, we use SSD-1B as the first-frame generator.

We present a demo of our Neodragon Mobile VDM as a plugin inside Adobe Premiere Pro. This plugin is being run on a laptop device which has a Qualcomm Snapdragon X Elite SoC, and containing the same Hexagon NPU that is present in Mobile devices running on Snapdragon 8 Gen 4 SoC. As shown by the TaskManager, the text-to-video model runs on the Hexagon NPU, and the entire video generation pipeline runs on device to generate the video clip. This Plugin demonstrates how our Mobile VDM can be easily integrated into existing video editing software, therby accelerating various creative workflows.

Neodragon : Mobile Video Generation using Diffusion Transformer

Qualcomm AI Research

Abstract

Qualitative Video Results

Text-Encoder Distillation Qualitative Results

Asymmetric Decoder Distillation Qualitative Results

MMDiT Block Pruning Qualitative Results

Step Distillation Qualitative Results

Adobe Premiere Pro Plugin Demo

BibTeX