Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. By optimizing the spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational requirements. We achieve this by lowering the resolution to 512 x 256 px, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks in the UNet. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14 x 512 x 256 px clip in 1.7 seconds on a Xiaomi 14 Pro.
We start with Stable Video Diffusion (SVD), which generates 14 frames at 1024 × 576 resolution with 25 sampling steps. Through the following optimizations, we enhance SVD's efficiency for mobile deployment:
Model | NFE | FVD | TFLOPs | Latency (ms) | |
---|---|---|---|---|---|
Resolution 1024 x 576 | GPU | Phone | |||
SVD | 50 | 149 | 45.43 | 376 | OOM |
AnimateLCM* | 8 | 281 | 45.43 | 376 | OOM |
UFOGen* | 1 | 1917 | 45.43 | 376 | OOM |
LADD* | 1 | 1894 | 45.43 | 376 | OOM |
SF-V* | 1 | 181 | 45.43 | 376 | OOM |
MobileVD-HD (ours) | 1 | 184 | 23.63 | 227 | OOM |
Resolution 512 x 256 | |||||
SVD | 50 | 476 | 8.6 | 82 | OOM |
MobileVD (ours) | 1 | 171 | 4.34 | 45 | 1780 |
FLOPs and latency are provided for a single function evaluation with batch size of 1. For rows marked with asterisk*, FVD measurements were taken from Zhang et al. , while performance metrics are based on our measurements for UNet used by SVD. For consistency with these results, FVD for SVD and our MobileVD model was measured on UCF-101 dataset at 7 frames per second. Phone Latency is measured on a Xiamo-14 Pro.
@article{
benyahia2024mobilevd,
author={Ben Yahia, Haitam and Korzhenkhov, Denis and Lelekas, Ioannis and Ghodrati, Amir and Habibian, Amirhossein},
title={Mobile Video Diffusion},
journal={arXiv},
year={2024},
url={https://arxiv.org/abs/2412.07583}
}