DreaMoving: A Human Video Generation Framework
based on Diffusion Models
Technical Report
-
Mengyang Feng
Jinlin Liu
Kai Yu
Yuan Yao
Zheng Hui
Xiefan Guo
Xianhui Lin Haolan Xue Chen Shi Xiaowen Li Aojie Li Xiaoyang Kang Biwen Lei
Miaomiao Cui Peiran Ren Xuansong XieInstitute for Intelligent Computing, Alibaba Group
TL;DR: DreaMoving is a diffusion-based controllable video generation framework to produce high-quality customized human videos.
A girl, smiling, standing on a beach next to the ocean, wearing light yellow dress with long sleeves. |
An Asian girl, smiling, dancing in central park, wearing long shirt and long jeans. |
A girl, smiling, in the park with golden leaves in autumn wearing coat with long sleeve. |
A man, dancing in front of Pyramids of Egypt, wearing a suit with a blue tie. |
A girl, smiling, dancing in a French town, wearing long light blue dress. |
A woman, smiling, in Times Square, wearing white clothes and long pants. |
Abstract
In this paper, we present DreaMoving, a diffusion-based controllable video generation framework to produce high-quality customized human videos. Specifically, given target identity and posture sequences, DreaMoving can generate a video of the target identity dancing anywhere driven by the posture sequences. To this end, we propose a Video ControlNet for motion-controlling and a Content Guider for identity preserving. The proposed model is easy to use and can be adapted to most stylized diffusion models to generate diverse results.
Architecture
The overview of DreaMoving. The Video ControlNet is the image ControlNet injected with motion blocks after each U-Net block. The Video ControlNet processes the control sequence (pose or depth) to additional temporal residuals. The Denoising U-Net is a derived Stable-Diffusion U-Net with motion blocks for video generation. The Content Guider transfers the input text prompts and appearance expressions, such as the human face (the cloth is optional), to content embeddings for cross attention.
Results
DreaMoving can generate high-quality and high-fidelity videos given guidance sequence and simple content description, e.g., text and reference image, as input. Specifically, DreaMoving demonstrates proficiency in identity control through a face reference image, precise motion manipulation via a pose sequence, and comprehensive video appearance control prompted by a specified text prompt.
|
||||
Reference Image |
Pose Sequence |
A girl, smiling, dancing in a French town, wearing a suit, and white shirt. |
A girl, smiling, dancing in front of a desk with green plants, wearing short shirt and long jeans. |
A girl, smiling, dancing on a beach next to the ocean, wearing short white dress with sleeves. |
|
||||
Reference Image |
Pose Sequence |
A girl, smiling, dancing in a wooden house, wearing sweater, and long pants. |
A girl, smiling, dancing in the park with golden leaves in autumn, wearing light blue dress. |
A girl, smiling, dancing in Times Square, wearing dress-like white shirt, with long sleeves, long pants. |
Unseen domain results
DreaMoving exhibits robust generalization capabilities on unseen domains.
Citation
Acknowledgements
The website template was borrowed from Michaƫl Gharbi and Mip-NeRF.