EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation

The University of Sydney
Beijing Technology and Business University

*Indicates Equal Contribution

Abstract

Conditional human animation transforms a static reference image into a dynamic sequence by applying motion cues such as poses. These motion cues are typically derived from video data but are susceptible to limitations including low temporal resolution, motion blur, overexposure, and inaccuracies under low-light conditions. In contrast, event cameras provide data streams with exceptionally high temporal resolution, a wide dynamic range, and inherent resistance to motion blur and exposure issues. In this work, we propose EvAnimate, a framework that leverages event streams as motion cues to animate static human images. Our approach employs a specialized event representation that transforms asynchronous event streams into 3-channel slices with controllable slicing rates and appropriate slice density, ensuring compatibility with diffusion models. Subsequently, a dual-branch architecture generates high-quality videos by harnessing the inherent motion dynamics of the event streams, thereby enhancing both video quality and temporal consistency. Specialized data augmentation strategies further enhance cross-person generalization. Finally, we establish a new benchmarking, including simulated event data for training and validation, and a real-world event dataset capturing human actions under normal and extreme scenarios. The experiment results demonstrate that EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.

Overview of EvAnimate

Mixed Video-Image Finetuning

Comparison between the conventional image-to-video methods and the proposed EvAnimate framework. EvAnimate leverages event streams as motion cues to generate controllable videos at high temporal resolutions. Moreover, EvAnimate produces superior video quality and exhibits enhanced robustness under challenging scenarios such as motion blur, low-light, and overexposure.

Structure of EvAnimate

Mixed Video-Image Finetuning

At its core, a spatial-temporal UNet generates latent representations of video frames. Four key components guide the process: (1) Reference Image Alignment preserves the visual characteristics of the input by projecting the reference image into the latent space via a VAE and integrating semantic features from CLIP and face encoders; (2) Event Condition Alignment controls motion by estimating pose from event signals and jointly encoding pose and event representations using a dual-encoder (EvPose Encoder); (3) Diffusion Loss serves as the primary training objective by matching the latent representations of generated and ground truth videos; and (4) Motion Gradient Alignment Loss leverages event conditions to enforce consistent, realistic motion dynamics.

Qualitative Results

Qualitative Results

BibTeX

BibTex Code Here