Anti-I2V

Safeguarding your photos from malicious image-to-video generation

TL;DR Anti-I2V defends user images against unauthorized image-to-video generation via dual-space (L*a*b* + DCT) perturbations and layer-wise representation losses (IRA, IRC), achieving state-of-the-art protection across DiT, MM-DiT, and U-Net diffusion backbones.

Duc Vu·Anh Nguyen·Chi Tran·Anh Tran

Qualcomm AI Research

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.

CVPR 2026 (Main)

Identity (ISM)
0.616 0.353
−43% identity preservation, averaged across all 3 backbones on CelebV-Text.
Video Quality (Q-Align)
0.793 0.378
−52% generated-video quality, averaged across all 3 backbones.
Feature Sim (DINO)
0.754 0.534
−29% DINO similarity to clean reference, averaged across all 3 backbones.
Generality
3 / 3
State-of-the-art on DiT · MM-DiT · U-Net backbones: CogVideoX-5B, Open-Sora 1.2, DynamiCrafter.

Abstract

Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the L*a*b* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.

Key Contributions

Anti-I2V consists of three core components that jointly enable powerful, robust, and effective protection across diverse video diffusion backbones.

Dual-Space Perturbation diagram
🧬
Dual-Space Perturbation (DSP)

We propose a robust perturbation strategy that goes beyond the standard RGB space, which has been shown insufficient against modern video diffusion models. Anti-I2V's DSP operates in two complementary domains: (1) the L*a*b* color space, applying adversarial noise exclusively to the perceptually decorrelated a* and b* channels; and (2) the DCT frequency domain, targeting the most influential low-frequency components that encode fundamental structural and textural information. This dual-space approach yields perturbations that are simultaneously more effective at disrupting feature propagation, less perceptible spatially, and more resilient to common transformations such as blurring and JPEG compression.

Internal Representation Anchor diagram
Internal Representation Anchor (IRA)

We propose the Internal Representation Anchor (IRA) loss, which minimizes the layer-wise Euclidean distance between hidden features produced when conditioning on the perturbed image versus an unrelated target image, across both the denoising network and the VAE. Unlike prior methods (e.g., AdvDM, MIST) that focus solely on the final output of the VAE, IRA disrupts feature extraction at every intermediate layer of all model components. This prevents the reconstruction of meaningful structure throughout the denoising process, amplifying the adversarial effects.

IRA Loss
Denoiser $$\mathcal{L}_{\mathrm{IRA},\,\epsilon_\theta}^{\,m} \;=\; \mathbb{E}\,\bigl\lVert\,\epsilon_\theta^{\,m}(z_t,\,z_\xi,\,t,\,y) \;-\; \epsilon_\theta^{\,m}(z_t,\,z_\psi,\,t,\,y)\,\bigr\rVert_{2}^{2}$$
VAE $$\mathcal{L}_{\mathrm{IRA},\,E}^{\,n} \;=\; \mathbb{E}\,\bigl\lVert\,E^{\,n}(z_\xi) \;-\; E^{\,n}(z_\psi)\,\bigr\rVert_{2}^{2}$$
Total $$\mathcal{L}_{\mathrm{IRA}} \;=\; \mathcal{L}_{\mathrm{IRA},\,\epsilon_\theta} \;+\; \mathcal{L}_{\mathrm{IRA},\,E}$$
Internal Representation Collapse diagram
🔗
Internal Representation Collapse (IRC)

Using PCA visualization of transformer block outputs during the denoising process, we identify that semantically rich features emerge in deeper layers (e.g., layer 27+ in CogVideoX, layer 19+ in OpenSora), while early layers (e.g., layer 3) contain minimal semantic information. The IRC loss forces deep-layer feature maps to resemble those of the early low-semantic layer, collapsing high-level semantic representations throughout the module. This cascades through the attention mechanism, degrading semantic coherence and visual continuity in generated videos.

IRC Loss $$\mathcal{L}_{\mathrm{IRC}}^{\,i,j} \;=\; \mathbb{E}\,\bigl\lVert\,\epsilon_\theta^{\,j}(z_t,\,z_\xi,\,t,\,y) \;-\; \epsilon_\theta^{\,i}(z_t,\,z_\xi,\,t,\,y)\,\bigr\rVert_{2}^{2}$$

Benchmark Comparison

Anti-I2V achieves the strongest protection across all three video diffusion backbones, on both face-centric and full-body human-action benchmarks. Lower values indicate poorer generated-video quality and thus stronger protection.

Model Method ISM ↓ C-FIQA ↓ Q-A (F) ↓ Q-A (V) ↓ DINO ↓
CogVideoX-5BMM-DiT Clean (no defense) 0.7210.5220.7460.8020.828
AdvDM 0.5830.4730.4630.5430.748
MIST 0.5610.4630.4760.5770.750
VGMShield 0.5540.4610.4640.5570.745
Anti-I2V (Ours) 0.448 0.433 0.447 0.532 0.722
DynamiCrafterU-Net Clean (no defense) 0.5280.4670.7240.7940.622
AdvDM 0.2690.3700.1670.2070.397
MIST 0.2620.3790.2320.2690.386
VGMShield 0.2860.4310.2430.2890.401
Anti-I2V (Ours) 0.151 0.303 0.032 0.047 0.167
Open-Sora 1.2DiT Clean (no defense) 0.5980.5080.7120.7820.811
AdvDM 0.5060.4780.4960.5610.725
MIST 0.4930.4750.4970.5940.710
VGMShield 0.5000.4760.4970.5780.716
Anti-I2V (Ours) 0.461 0.453 0.478 0.554 0.713
Best per metric in blue bold. Clean rows show the unprotected baseline. All methods use a perturbation budget of 16/255 with 200 optimization iterations on a single A100 GPU.

Demo Gallery

Anti-I2V provides robust protection across diverse image-to-video generation models,
degrading output quality while preserving the imperceptibility of the applied perturbations.

CogVideoX-5B
DynamiCrafter
Open-Sora 1.2

Citation

BibTeX
@InProceedings{Vu_2026_CVPR,
  author    = {Vu, Duc and Nguyen, Anh and Tran, Chi and Tran, Anh},
  title     = {Anti-I2V: Safeguarding your Photos from Malicious Image-to-video Generation},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month     = {June},
  year      = {2026},
  pages     = {37621-37631}
}