Dance Any Beat: Blending Beats with Visuals in Dance Video Generation

1University of Sydney

Abstract

We explore the intersection of two distinct research areas: conditional image-to-video generation and music-to-dance generation. We introduce a novel approach that directly generates dance videos from a single image and accompanying music. This research not only investigates the role of music as a conditional input in image-to-video synthesis but also sets a foundational benchmark for direct dance video generation in response to musical cues. We further examine the extraction of crucial information for enhancing motion-music beat alignment, leveraging both a large-scale foundational model and established signal-processing tool. Through extensive experimentation, we have established a robust baseline for this novel task.

Method

Interpolate start reference image.

Interpolate start reference image.

The aim of our methodology is to generate latent optical flows guided by musical inputs. Recent works in motion transfer have demonstrated the efficacy of employing latent optical flow for warping one image into another. Additionally, the generation process becomes more resource-efficient when operating in a low-dimensional latent flow space, which requires less computational power and time compared to working within high-dimensional pixel or latent feature spaces. In selecting a model for generation, the exceptional quality and robust controllability afforded by diffusion models make them an ideal choice. The initial phase of our methodology involves training an auto-encoder to discern the optical flow between two frames within a video sequence. Subsequently, this trained auto-encoder aids in the training of the diffusion model, enabling the generation of latent flows. Another fundamental aspect of our approach is the extraction of musical information. For our baseline model, we employ CLAP to encode the music, while our enhanced model additionally incorporates beat information for a better representation.

Dance any beats with any person

DabFusion is capable of animating individuals in unseen scenarios. This process requires a series of preprocessing steps. First, we employ YOLO 1 as an object detector and Segment Anything Model2 as a segmentation model to identify and segment the person from unseen scenarios. Subsequently, we substitute the individual in the image from the AIST++3 dataset with the newly segmented person, thus creating an image that combines the unfamiliar person with the background from the dataset.

Dance video generation with any person

We select five individuals from various settings, each adopting unique poses to generate dance videos from different musics. The resolution of generated videos is 128x128.


Generate dance videos with the same music piece

Break Dance

House Dance

Pop Dance

Ballet Jazz Dance

Street Jazz Dance

Lock Dance

Generate dance videos with random music pieces

Generate videos on AIST++ dataset

DabFusion is trained on AIST++ training set, we generate videos in different scenarios on test set to show the versatility of DabFusion.

Same music, different angles, different poses.

Same starting frame, different musics


References

  1. Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

  2. Kirillov, Alexander, et al. "Segment anything." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

  3. Li, Ruilong, et al. "Ai choreographer: Music conditioned 3d dance generation with aist++." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.