12 FPS Editing on Mobile!

Abstract

Recent progress in diffusion-based video editing techniques has shown remarkable potential and is being increasingly utilized in practical applications. However, these methods remain prohibitively expensive and particularly challenging to deploy on mobile devices. In this study, we introduce a series of optimizations that render mobile video editing feasible. Building upon the existing image editing model, we first optimize its architecture and incorporate a lightweight autoencoder. Subsequently, we extend classifier-free guidance distillation to multiple modalities, resulting in a threefold on-device speedup. Finally, we reduce the number of sampling steps to one by introducing a novel adversarial distillation scheme which preserves the controllability of the editing process. Collectively, these optimizations enable video editing at an impressive 12 frames per second on mobile devices, while maintaining high editing quality.

Method

In this work, we introduce several optimizations to accelerate diffusion-based video editing:

  1. Architectural Enhancement: We enhance the denoising UNet by removing self-attention and cross-attention at the highest resolutions, which are the most computationally expensive layers. We also replace the standard autoencoder with a lightweight version called “Tiny AutoEncoder for Stable Diffusion” (TAESD). These changes result in the Mobile-Pix2Pix model, which is more efficient.
  2. Multimodal Guidance Distillation: We perform a novel distillation process that combines classifier-free guidance (CFG) with text and image inputs into a single forward pass per diffusion step, achieving a threefold increase in inference speed. Typically, CFG with several modalities requires at least three forward passes per step, but our method consolidates this into one, maintaining quality and control. This is done by modifying ResNet blocks to accept guidance scales as embeddings, allowing efficient processing. Details are shown in Figure 2 below.
  3. Adversarial Distillation: We optimize the number of diffusion steps to one through a novel adversarial distillation procedure. This method distills a multi-step model into a single-step model while preserving the controllability of edits. The student model, initialized from the teacher model, uses a discriminator to distinguish real from fake samples, enhancing its ability to produce high-quality, controlled edits. Details are shown in Figure 3 below.

These optimizations enable 12 frames per second video editing on mobile devices, marking a significant milestone towards real-time text-guided video editing on mobile platforms.

Qualitative Results on Faces Dataset

Input

"In chinese ink style"

"In caricature style"

"In pop art style"

Input

"Turn him into silver surfer"

"Add wrinkles"

"Add sunglasses"

Input

"In pixar 3d style"

"Turn him into vampire"

"In pencil drawing style"

Input

"Make him bronze"

"Turn him into hulk"

"In Minecraft style"

Comparison to the Base Model on Faces Dataset

Input Video

Input

Base Model

"In Monet style"

MoViE

"In Monet style"

Input

"Make him wooden"

"Make him wooden"

Comparison to SOTA Methods on DAVIS

Input Video

Input

InsV2V

"Make it desert"

Rerender-a-Video

"Make it desert"

TokenFlow

"Make it desert"

MoViE

"Make it desert"

Input

"Turn the swan into flamingo"

"Turn the swan into flamingo"

"Turn the swan into flamingo"

"Turn the swan into flamingo"

Input

"Add grass"

"Add grass"

"Add grass"

"Add grass"

Input

"Add snow"

"Add snow"

"Add snow"

"Add snow"

Input

"Make him zombie"

"Make him zombie"

"Make him zombie"

"Make him zombie"

Input

"Make him yeti"

"Make him yeti"

"Make him yeti"

"Make him yeti"

Input

"Make her hair blonde"

"Make her hair blonde"

"Make her hair blonde"

"Make her hair blonde"

Input

"Add fire"

"Add fire"

"Add fire"

"Add fire"