Mobile Video Diffusion

Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. By optimizing the spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational requirements. We achieve this by lowering the resolution to 512 x 256 px, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks in the UNet. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14 x 512 x 256 px clip in 1.7 seconds on a Xiaomi 14 Pro.

We start with Stable Video Diffusion (SVD), which generates 14 frames at 1024 × 576 resolution with 25 sampling steps. Through the following optimizations, we enhance SVD's efficiency for mobile deployment:

Adversarial Finetuning: Reducing the number of denoising steps to a single step using adversarial finetuning, enabling faster video generation.
Low Resolution Finetuning: Reducing to the mobile resolution of 512 × 256 by finetuning the diffusion denoiser at this resolution.
Temporal Multi-scaling: Adding down-sampling and up-sampling operations along the temporal axis.
Optimized Cross-Attention: Cross-attention layers are optimized by removing non-op computations, enabling on-device model execution.
Channel Funnels: Reducing the number of channels to save computations with minimal quality loss. A channel funnel is placed between two affine layers, reducing intermediate channel dimensionality.

Temporal Block Pruning: A learnable pruning technique to remove less important temporal blocks while minimizing quality degradation.

Model	NFE	FVD	TFLOPs	Latency (ms)
Resolution 1024 x 576				GPU	Phone
SVD	50	149	45.43	376	OOM
AnimateLCM*	8	281	45.43	376	OOM
UFOGen*	1	1917	45.43	376	OOM
LADD*	1	1894	45.43	376	OOM
SF-V*	1	181	45.43	376	OOM
MobileVD-HD (ours)	1	184	23.63	227	OOM
Resolution 512 x 256
SVD	50	476	8.6	82	OOM
MobileVD (ours)	1	171	4.34	45	1780

FLOPs and latency are provided for a single function evaluation with batch size of 1. For rows marked with asterisk*, FVD measurements were taken from Zhang et al. , while performance metrics are based on our measurements for UNet used by SVD. For consistency with these results, FVD for SVD and our MobileVD model was measured on UCF-101 dataset at 7 frames per second. Phone Latency is measured on a Xiamo-14 Pro.

              
              @article{
                benyahia2024mobilevd,
                author={Ben Yahia, Haitam and Korzhenkhov, Denis and Lelekas, Ioannis and Ghodrati, Amir and Habibian, Amirhossein},
                title={Mobile Video Diffusion},
                journal={arXiv},
                year={2024},
                url={https://arxiv.org/abs/2412.07583}
              }

Mobile Video Diffusion

Qualcomm AI Research

*Equal contribution

Abstract

Method

Results

SVD

AnimateLCM

SF-V

MobileVD

BibTex