Safeguarding your photos from malicious image-to-video generation
TL;DR Anti-I2V defends user images against unauthorized image-to-video generation via dual-space (L*a*b* + DCT) perturbations and layer-wise representation losses (IRA, IRC), achieving state-of-the-art protection across DiT, MM-DiT, and U-Net diffusion backbones.
Qualcomm AI Research
Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc.
CVPR 2026 (Main)
Advances in diffusion-based video generation models, while significantly improving human animation, poses threats of misuse through the creation of fake videos from a specific person's photo and text prompts. Recent efforts have focused on adversarial attacks that introduce crafted perturbations to protect images from diffusion-based models. However, most existing approaches target image generation, while relatively few explicitly address image-to-video diffusion models (VDMs), and most primarily focus on UNet-based architectures. Hence, their effectiveness against Diffusion Transformer (DiT) models remains largely under-explored, as these models demonstrate improved feature retention, and stronger temporal consistency due to larger capacity and advanced attention mechanisms. In this work, we introduce Anti-I2V, a novel defense against malicious human image-to-video generation, applicable across diverse diffusion backbones. Instead of restricting noise updates to the RGB space, Anti-I2V operates in both the L*a*b* and frequency domains, improving robustness and concentrating on salient pixels. We then identify the network layers that capture the most distinct semantic features during the denoising process to design appropriate training objectives that maximize degradation of temporal coherence and generation fidelity. Through extensive validation, Anti-I2V demonstrates state-of-the-art defense performance against diverse video diffusion models, offering an effective solution to the problem.
Anti-I2V consists of three core components that jointly enable powerful, robust, and effective protection across diverse video diffusion backbones.
We propose a robust perturbation strategy that goes beyond the standard RGB space, which has been shown insufficient against modern video diffusion models. Anti-I2V's DSP operates in two complementary domains: (1) the L*a*b* color space, applying adversarial noise exclusively to the perceptually decorrelated a* and b* channels; and (2) the DCT frequency domain, targeting the most influential low-frequency components that encode fundamental structural and textural information. This dual-space approach yields perturbations that are simultaneously more effective at disrupting feature propagation, less perceptible spatially, and more resilient to common transformations such as blurring and JPEG compression.
We propose the Internal Representation Anchor (IRA) loss, which minimizes the layer-wise Euclidean distance between hidden features produced when conditioning on the perturbed image versus an unrelated target image, across both the denoising network and the VAE. Unlike prior methods (e.g., AdvDM, MIST) that focus solely on the final output of the VAE, IRA disrupts feature extraction at every intermediate layer of all model components. This prevents the reconstruction of meaningful structure throughout the denoising process, amplifying the adversarial effects.
Using PCA visualization of transformer block outputs during the denoising process, we identify that semantically rich features emerge in deeper layers (e.g., layer 27+ in CogVideoX, layer 19+ in OpenSora), while early layers (e.g., layer 3) contain minimal semantic information. The IRC loss forces deep-layer feature maps to resemble those of the early low-semantic layer, collapsing high-level semantic representations throughout the module. This cascades through the attention mechanism, degrading semantic coherence and visual continuity in generated videos.
Anti-I2V achieves the strongest protection across all three video diffusion backbones, on both face-centric and full-body human-action benchmarks. Lower values indicate poorer generated-video quality and thus stronger protection.
| Model | Method | ISM ↓ | C-FIQA ↓ | Q-A (F) ↓ | Q-A (V) ↓ | DINO ↓ |
|---|---|---|---|---|---|---|
| CogVideoX-5BMM-DiT | Clean (no defense) | 0.721 | 0.522 | 0.746 | 0.802 | 0.828 |
| AdvDM | 0.583 | 0.473 | 0.463 | 0.543 | 0.748 | |
| MIST | 0.561 | 0.463 | 0.476 | 0.577 | 0.750 | |
| VGMShield | 0.554 | 0.461 | 0.464 | 0.557 | 0.745 | |
| Anti-I2V (Ours) | 0.448 | 0.433 | 0.447 | 0.532 | 0.722 | |
| DynamiCrafterU-Net | Clean (no defense) | 0.528 | 0.467 | 0.724 | 0.794 | 0.622 |
| AdvDM | 0.269 | 0.370 | 0.167 | 0.207 | 0.397 | |
| MIST | 0.262 | 0.379 | 0.232 | 0.269 | 0.386 | |
| VGMShield | 0.286 | 0.431 | 0.243 | 0.289 | 0.401 | |
| Anti-I2V (Ours) | 0.151 | 0.303 | 0.032 | 0.047 | 0.167 | |
| Open-Sora 1.2DiT | Clean (no defense) | 0.598 | 0.508 | 0.712 | 0.782 | 0.811 |
| AdvDM | 0.506 | 0.478 | 0.496 | 0.561 | 0.725 | |
| MIST | 0.493 | 0.475 | 0.497 | 0.594 | 0.710 | |
| VGMShield | 0.500 | 0.476 | 0.497 | 0.578 | 0.716 | |
| Anti-I2V (Ours) | 0.461 | 0.453 | 0.478 | 0.554 | 0.713 | |
| CogVideoX-5BMM-DiT | Clean (no defense) | 0.466 | 0.373 | 0.361 | 0.436 | 0.801 |
| AdvDM | 0.370 | 0.292 | 0.271 | 0.342 | 0.753 | |
| MIST | 0.355 | 0.290 | 0.262 | 0.340 | 0.751 | |
| VGMShield | 0.361 | 0.292 | 0.265 | 0.343 | 0.753 | |
| Anti-I2V (Ours) | 0.346 | 0.283 | 0.251 | 0.323 | 0.734 | |
| DynamiCrafterU-Net | Clean (no defense) | 0.384 | 0.345 | 0.501 | 0.562 | 0.709 |
| AdvDM | 0.110 | 0.335 | 0.162 | 0.201 | 0.451 | |
| MIST | 0.100 | 0.335 | 0.322 | 0.381 | 0.493 | |
| VGMShield | 0.108 | 0.336 | 0.318 | 0.374 | 0.486 | |
| Anti-I2V (Ours) | 0.068 | 0.268 | 0.057 | 0.084 | 0.164 | |
| Open-Sora 1.2DiT | Clean (no defense) | 0.400 | 0.382 | 0.409 | 0.437 | 0.750 |
| AdvDM | 0.346 | 0.309 | 0.327 | 0.362 | 0.686 | |
| MIST | 0.339 | 0.309 | 0.338 | 0.392 | 0.677 | |
| VGMShield | 0.341 | 0.312 | 0.335 | 0.369 | 0.680 | |
| Anti-I2V (Ours) | 0.318 | 0.248 | 0.311 | 0.347 | 0.642 |
Anti-I2V provides robust protection across diverse image-to-video generation models,
degrading output quality while preserving the imperceptibility of the applied perturbations.
* Videos may take a moment to load — please allow a few seconds.
Prompt "A static medium shot of a doctor in a formal lab setting. He wears a light-colored collared shirt and keep their hands clasped in front of them. A lamp and globe are visible in the background. The camera remains fixed, with no changes in posture, framing, or environment throughout the video."
Prompt "A woman with short dark hair seated in a dimly lit room. She wears a dark top and sit relaxed in a chair, with one arm resting on the armrest and the other extended along the chair. The background is softly blurred, keeping focus on the person."
Prompt "A video of a woman with short curly hair seated indoors, speaking as if in a conversation. The background shows a neutral room with a bookshelf and a framed picture on the wall. The camera remains fixed, focused from the waist up. The woman mostly sits still with hands on their lap, then raises their hands briefly to gesture while speaking."
Prompt "A close-up of a person with curly hair, wearing a pink headband and a white top, against a plain light-colored background. At first, their eyes are closed and their mouth is slightly open. Then their expression shifts to surprise, with eyes wide open and mouth agape."
Prompt "A middle-aged man seated in a chair, wearing a purple shirt. They face the camera directly with a relaxed posture and hands resting on the armrests, as if in a conversation. The background is dark and simple, keeping attention on the person."
Prompt "A man with long curly hair seated in a studio, wearing a dark top. They sit in front of a microphone as if recording or speaking on a podcast. A plain backdrop and microphone stand are visible, with a plant on the right. The person keeps their hands on their lap and maintains a steady posture. The camera remains fixed with no major scene changes."
Prompt "A static frontal medium shot of a man with brown beard giving a speech behind a fish tank outdoors. They wear a light-colored shirt and a dark hat, gesturing with their left hand to the left while speaking. A clear blue sky and calm body of water are visible in the background."
Prompt "A woman with long dark hair seated against a plain light-colored wall, wearing a gray top. They gesture expressively while speaking, first raising one hand, then both hands with palms upward. Their expression becomes animated, with wide eyes and an open mouth, as if reacting or emphasizing a point. The background stays unchanged and the camera remains fixed."
Prompt "A man sits quietly in a dimly lit room washed in red and purple tones. They wear a black top, and their hair partly hides their eyes, giving the scene a mysterious feeling."
Prompt "A 30-year-old woman sits on a black couch in a fixed medium shot, wearing a dark top layered with a patterned garment. She gestures expressively with both arms, alternately raising and lowering her hands with her palms facing upward or downward."
Prompt "A static medium close-up of a soldier wearing a dark patterned shirt, with the background softly blurred. He begins with their hands clasped in front of them, then slowly open and raise both hands, with the right hand slightly higher than the left, as if emphasizing a point."
Prompt "A formally dressed speaker stands behind a wooden desk, wearing a dark suit, tie, and white shirt. A laptop and microphone on the desk create a polished professional setting. The speaker remains composed with hands clasped in front, as if delivering a serious presentation or pausing for emphasis. The camera stays still, with no changes to the framing, lighting, or background."
@InProceedings{Vu_2026_CVPR, author = {Vu, Duc and Nguyen, Anh and Tran, Chi and Tran, Anh}, title = {Anti-I2V: Safeguarding your Photos from Malicious Image-to-video Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {37621-37631} }