Anti-I2V: Safeguarding your photos from malicious image-to-video generation

Abstract

Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the L*a*b* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.

Method

Key Contributions

Anti-I2V consists of three core components that jointly enable powerful, robust, and effective protection across diverse video diffusion backbones.

🧬

Dual-Space Perturbation (DSP)

We propose a robust perturbation strategy that goes beyond the standard RGB space, which has been shown insufficient against modern video diffusion models. Anti-I2V's DSP operates in two complementary domains: (1) the L*a*b* color space, applying adversarial noise exclusively to the perceptually decorrelated a* and b* channels; and (2) the DCT frequency domain, targeting the most influential low-frequency components that encode fundamental structural and textural information. This dual-space approach yields perturbations that are simultaneously more effective at disrupting feature propagation, less perceptible spatially, and more resilient to common transformations such as blurring and JPEG compression.

⚓

Internal Representation Anchor (IRA)

We propose the Internal Representation Anchor (IRA) loss, which minimizes the layer-wise Euclidean distance between hidden features produced when conditioning on the perturbed image versus an unrelated target image, across both the denoising network and the VAE. Unlike prior methods (e.g., AdvDM, MIST) that focus solely on the final output of the VAE, IRA disrupts feature extraction at every intermediate layer of all model components. This prevents the reconstruction of meaningful structure throughout the denoising process, amplifying the adversarial effects.

IRA Loss

Denoiser $$\mathcal{L}_{\mathrm{IRA},\,\epsilon_\theta}^{\,m} \;=\; \mathbb{E}\,\bigl\lVert\,\epsilon_\theta^{\,m}(z_t,\,z_\xi,\,t,\,y) \;-\; \epsilon_\theta^{\,m}(z_t,\,z_\psi,\,t,\,y)\,\bigr\rVert_{2}^{2}$$

VAE $$\mathcal{L}_{\mathrm{IRA},\,E}^{\,n} \;=\; \mathbb{E}\,\bigl\lVert\,E^{\,n}(z_\xi) \;-\; E^{\,n}(z_\psi)\,\bigr\rVert_{2}^{2}$$

Total $$\mathcal{L}_{\mathrm{IRA}} \;=\; \mathcal{L}_{\mathrm{IRA},\,\epsilon_\theta} \;+\; \mathcal{L}_{\mathrm{IRA},\,E}$$

Internal Representation Collapse diagram

🔗

Internal Representation Collapse (IRC)

Using PCA visualization of transformer block outputs during the denoising process, we identify that semantically rich features emerge in deeper layers (e.g., layer 27+ in CogVideoX, layer 19+ in OpenSora), while early layers (e.g., layer 3) contain minimal semantic information. The IRC loss forces deep-layer feature maps to resemble those of the early low-semantic layer, collapsing high-level semantic representations throughout the module. This cascades through the attention mechanism, degrading semantic coherence and visual continuity in generated videos.

IRC Loss $$\mathcal{L}_{\mathrm{IRC}}^{\,i,j} \;=\; \mathbb{E}\,\bigl\lVert\,\epsilon_\theta^{\,j}(z_t,\,z_\xi,\,t,\,y) \;-\; \epsilon_\theta^{\,i}(z_t,\,z_\xi,\,t,\,y)\,\bigr\rVert_{2}^{2}$$

Quantitative Results

Benchmark Comparison

Anti-I2V achieves the strongest protection across all three video diffusion backbones, on both face-centric and full-body human-action benchmarks. Lower values indicate poorer generated-video quality and thus stronger protection.

Model	Method	ISM ↓	C-FIQA ↓	Q-A (F) ↓	Q-A (V) ↓	DINO ↓
CogVideoX-5BMM-DiT	Clean (no defense)	0.721	0.522	0.746	0.802	0.828
	AdvDM	0.583	0.473	0.463	0.543	0.748
	MIST	0.561	0.463	0.476	0.577	0.750
	VGMShield	0.554	0.461	0.464	0.557	0.745
	Anti-I2V (Ours)	0.448	0.433	0.447	0.532	0.722
DynamiCrafterU-Net	Clean (no defense)	0.528	0.467	0.724	0.794	0.622
	AdvDM	0.269	0.370	0.167	0.207	0.397
	MIST	0.262	0.379	0.232	0.269	0.386
	VGMShield	0.286	0.431	0.243	0.289	0.401
	Anti-I2V (Ours)	0.151	0.303	0.032	0.047	0.167
Open-Sora 1.2DiT	Clean (no defense)	0.598	0.508	0.712	0.782	0.811
	AdvDM	0.506	0.478	0.496	0.561	0.725
	MIST	0.493	0.475	0.497	0.594	0.710
	VGMShield	0.500	0.476	0.497	0.578	0.716
	Anti-I2V (Ours)	0.461	0.453	0.478	0.554	0.713
CogVideoX-5BMM-DiT	Clean (no defense)	0.466	0.373	0.361	0.436	0.801
	AdvDM	0.370	0.292	0.271	0.342	0.753
	MIST	0.355	0.290	0.262	0.340	0.751
	VGMShield	0.361	0.292	0.265	0.343	0.753
	Anti-I2V (Ours)	0.346	0.283	0.251	0.323	0.734
DynamiCrafterU-Net	Clean (no defense)	0.384	0.345	0.501	0.562	0.709
	AdvDM	0.110	0.335	0.162	0.201	0.451
	MIST	0.100	0.335	0.322	0.381	0.493
	VGMShield	0.108	0.336	0.318	0.374	0.486
	Anti-I2V (Ours)	0.068	0.268	0.057	0.084	0.164
Open-Sora 1.2DiT	Clean (no defense)	0.400	0.382	0.409	0.437	0.750
	AdvDM	0.346	0.309	0.327	0.362	0.686
	MIST	0.339	0.309	0.338	0.392	0.677
	VGMShield	0.341	0.312	0.335	0.369	0.680
	Anti-I2V (Ours)	0.318	0.248	0.311	0.347	0.642

Best per metric in blue bold. Clean rows show the unprotected baseline. All methods use a perturbation budget of 16/255 with 200 optimization iterations on a single A100 GPU.

Results

Demo Gallery

Anti-I2V provides robust protection across diverse image-to-video generation models,
degrading output quality while preserving the imperceptibility of the applied perturbations.

* Videos may take a moment to load — please allow a few seconds.

CogVideoX-5B

Clean Anti-I2V

Prompt "A static medium shot of a doctor in a formal lab setting. He wears a light-colored collared shirt and keep their hands clasped in front of them. A lamp and globe are visible in the background. The camera remains fixed, with no changes in posture, framing, or environment throughout the video."

Clean Anti-I2V

Prompt "A woman with short dark hair seated in a dimly lit room. She wears a dark top and sit relaxed in a chair, with one arm resting on the armrest and the other extended along the chair. The background is softly blurred, keeping focus on the person."

Clean Anti-I2V

Prompt "A video of a woman with short curly hair seated indoors, speaking as if in a conversation. The background shows a neutral room with a bookshelf and a framed picture on the wall. The camera remains fixed, focused from the waist up. The woman mostly sits still with hands on their lap, then raises their hands briefly to gesture while speaking."

Clean Anti-I2V

Prompt "A close-up of a person with curly hair, wearing a pink headband and a white top, against a plain light-colored background. At first, their eyes are closed and their mouth is slightly open. Then their expression shifts to surprise, with eyes wide open and mouth agape."

DynamiCrafter

Clean Anti-I2V

Prompt "A middle-aged man seated in a chair, wearing a purple shirt. They face the camera directly with a relaxed posture and hands resting on the armrests, as if in a conversation. The background is dark and simple, keeping attention on the person."

Clean Anti-I2V

Prompt "A man with long curly hair seated in a studio, wearing a dark top. They sit in front of a microphone as if recording or speaking on a podcast. A plain backdrop and microphone stand are visible, with a plant on the right. The person keeps their hands on their lap and maintains a steady posture. The camera remains fixed with no major scene changes."

Clean Anti-I2V

Prompt "A static frontal medium shot of a man with brown beard giving a speech behind a fish tank outdoors. They wear a light-colored shirt and a dark hat, gesturing with their left hand to the left while speaking. A clear blue sky and calm body of water are visible in the background."

Clean Anti-I2V

Prompt "A woman with long dark hair seated against a plain light-colored wall, wearing a gray top. They gesture expressively while speaking, first raising one hand, then both hands with palms upward. Their expression becomes animated, with wide eyes and an open mouth, as if reacting or emphasizing a point. The background stays unchanged and the camera remains fixed."

Open-Sora 1.2

Clean Anti-I2V

Prompt "A man sits quietly in a dimly lit room washed in red and purple tones. They wear a black top, and their hair partly hides their eyes, giving the scene a mysterious feeling."

Clean Anti-I2V

Prompt "A 30-year-old woman sits on a black couch in a fixed medium shot, wearing a dark top layered with a patterned garment. She gestures expressively with both arms, alternately raising and lowering her hands with her palms facing upward or downward."

Clean Anti-I2V

Prompt "A static medium close-up of a soldier wearing a dark patterned shirt, with the background softly blurred. He begins with their hands clasped in front of them, then slowly open and raise both hands, with the right hand slightly higher than the left, as if emphasizing a point."

Clean Anti-I2V

Prompt "A formally dressed speaker stands behind a wooden desk, wearing a dark suit, tie, and white shirt. A laptop and microphone on the desk create a polished professional setting. The speaker remains composed with hands clasped in front, as if delivering a serious presentation or pausing for emphasis. The camera stays still, with no changes to the framing, lighting, or background."

Anti-I2V

Abstract

Key Contributions

Benchmark Comparison

Demo Gallery

Citation