Abstract

Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. By optimizing the spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational requirements. We achieve this by lowering the resolution to 512 x 256 px, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks in the UNet. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14 x 512 x 256 px clip in 1.7 seconds on a Xiaomi 14 Pro.

Method

We start with Stable Video Diffusion (SVD), which generates 14 frames at 1024 × 576 resolution with 25 sampling steps. Through the following optimizations, we enhance SVD's efficiency for mobile deployment:

  1. Adversarial Finetuning: Reducing the number of denoising steps to a single step using adversarial finetuning, enabling faster video generation.
  2. Low Resolution Finetuning: Reducing to the mobile resolution of 512 × 256 by finetuning the diffusion denoiser at this resolution.
  3. Temporal Multi-scaling: Adding down-sampling and up-sampling operations along the temporal axis.
  4. Optimized Cross-Attention: Cross-attention layers are optimized by removing non-op computations, enabling on-device model execution.
  5. Channel Funnels: Reducing the number of channels to save computations with minimal quality loss. A channel funnel is placed between two affine layers, reducing intermediate channel dimensionality.
  6. Temporal Block Pruning: A learnable pruning technique to remove less important temporal blocks while minimizing quality degradation.

Results

Model NFE FVD TFLOPs Latency (ms)
Resolution 1024 x 576 GPU Phone
SVD 50 149 45.43 376 OOM
AnimateLCM* 8 281 45.43 376 OOM
UFOGen* 1 1917 45.43 376 OOM
LADD* 1 1894 45.43 376 OOM
SF-V* 1 181 45.43 376 OOM
MobileVD-HD (ours) 1 184 23.63 227 OOM
Resolution 512 x 256
SVD 50 476 8.6 82 OOM
MobileVD (ours) 1 171 4.34 45 1780

FLOPs and latency are provided for a single function evaluation with batch size of 1. For rows marked with asterisk*, FVD measurements were taken from Zhang et al. , while performance metrics are based on our measurements for UNet used by SVD. For consistency with these results, FVD for SVD and our MobileVD model was measured on UCF-101 dataset at 7 frames per second. Phone Latency is measured on a Xiamo-14 Pro.

SVD

AnimateLCM

SF-V

MobileVD

BibTex

              
              @article{
                benyahia2024mobilevd,
                author={Ben Yahia, Haitam and Korzhenkhov, Denis and Lelekas, Ioannis and Ghodrati, Amir and Habibian, Amirhossein},
                title={Mobile Video Diffusion},
                journal={arXiv},
                year={2024},
                url={https://arxiv.org/abs/2412.07583}
              }