MEDUSA: Motion Elimination in Diffusion Using Spectral Attack

Hongwei Yu*, Daoqing Zha*, Xinlong Ding, Jiawei Li, Junbao Zhuo, Qiankun Liu, Huimin Ma, Jiansheng Chen
University of Science and Technology Beijing
* Equal Contribution · Corresponding Author
Overview of the MEDUSA spectral attack pipeline
MEDUSA formulates motion elimination as a spectral attack on video diffusion models. Starting from a clean reference image, it optimizes an imperceptible adversarial perturbation by minimizing the nuclear norm of the temporal attention matrix, suppressing trailing singular values and inducing temporal rank collapse. The resulting rank-1 attention pattern makes frames attend to nearly the same temporal content, effectively freezing motion while preserving the scene semantics.

Abstract

With the widespread application of Video Diffusion Models (VDMs), video synthesis has achieved remarkable temporal dynamics. Image-to-Video (I2V) generation allows users to provide reference images, which enables attackers to inject adversarial noise into these conditions. Due to the robust spatio-temporal priors in VDMs, conventional frame-level attacks merely induce superficial artifacts and struggle to suppress the synthesis of motion semantics. In this work, we approach the problem by exploring the underlying mechanism of temporal dynamics. We reveal that the static video manifests as temporal rank collapse, a degenerate state characterized by rank-1 degeneracy within the temporal attention matrix. Guided by this insight, we propose Motion Elimination in Diffusion Using Spectral Attack (MEDUSA) to freeze the video. It minimizes the nuclear norm of the attention matrix to induce the temporal rank collapse. This objective circumvents the vanishing gradient problem encountered when directly imposing a rigid temporal mapping on the attention matrix. Furthermore, we provide a mathematical analysis of this phenomenon and the gradient vanishing problem during the optimization. Experiments confirm that MEDUSA achieves excellent performance and validates the effectiveness of spectral constraints.

Analysis

Temporal attention maps and singular value spectrum analysis for MEDUSA
Temporal attention reveals why MEDUSA eliminates motion. Clean generation keeps a high-rank, diagonal-dominant attention structure that supports temporal change, while the attacked attention matrix forms vertical stripe patterns: query frames attend to nearly identical key frames. The singular-value spectrum decays sharply after the attack, providing visual evidence of temporal rank collapse.
Optimization comparison between MEDUSA and a hard-target baseline
The optimization comparison explains the advantage of the spectral objective. A hard target that directly forces a static attention template encounters the vanishing gradient problem described in the paper and fails to optimize reliably. MEDUSA instead minimizes the nuclear norm of temporal attention, giving a smooth and stable descent path toward a lower-rank state.
Qualitative image examples showing clean inputs and adversarial perturbation results
Qualitative results show that MEDUSA suppresses temporal dynamics more completely than existing image- and video-level baselines. Competing attacks may leave residual foreground or background motion, introduce color shifts, or create structural artifacts, whereas MEDUSA produces perceptually static video sequences while preserving the semantic content of the input scene.

Result

BibTeX

@inproceedings{yu2026medusa,
  title     = {MEDUSA: Motion Elimination in Diffusion Using Spectral Attack},
  author    = {Yu, Hongwei and Zha, Daoqing and Ding, Xinlong and Li, Jiawei and Zhuo, Junbao and Liu, Qiankun and Ma, Huimin and Chen, Jiansheng},
  booktitle = {Proceedings of the International Conference on Machine Learning},
  year      = {2026}
}