X-Dancer Generates Realistic Dance Videos from Still Images and Music

Top post
From Still Images to Dance Videos: AI-Generated Choreographies Thanks to X-Dancer
The generation of realistic and expressive human movements through Artificial Intelligence (AI) is a research field with great potential. Applications range from animation in films and video games to the development of virtual trainers for sports and dance. A new approach in this area is X-Dancer, an innovative method for generating dance videos from music and a single still image. X-Dancer makes it possible to create diverse and long, realistic dance sequences without the need for 3D models or complex motion capture.
How X-Dancer Works
X-Dancer is based on a Transformer-Diffusion framework. At the core of this system is an autoregressive Transformer model. This model synthesizes long, music-synchronized token sequences for 2D poses of the body, head, and hands. These pose sequences then serve as the basis for a diffusion model, which generates coherent and realistic individual frames for the dance video.
In contrast to traditional methods, which primarily model human movement in 3D, X-Dancer focuses on the 2D plane. This approach bypasses the challenges of data acquisition for 3D models and increases the scalability of the system. By using readily available monocular videos, X-Dancer can capture a wide range of 2D dance movements and learn their subtle coordination with the musical rhythm.
The representation of the 2D poses is done through spatially composed tokens, which are formed from 2D keypoint labels and their confidence values. These tokens encode both gross body movements (e.g., upper and lower body) and finer movements (e.g., head and hands). The music-to-motion Transformer model then autoregressively generates the dance pose token sequences, taking into account both the musical style and the preceding movement context using global attention.
Finally, X-Dancer uses a diffusion model to animate the reference image with the synthesized pose tokens. By using Adaptive Instance Normalization (AdaIN), a fully differentiable end-to-end architecture is created.
Potential and Outlook
Initial results show that X-Dancer can generate diverse and characteristic dance videos and surpasses existing methods in terms of diversity, expressiveness, and realism. The technology could revolutionize the creation of creative content and open up new possibilities in areas such as entertainment, virtual training, and digital art.
The developers of X-Dancer plan to make the code and model available for research purposes. This will enable further research and development in this promising area and could lead to further innovations in AI-powered motion generation.
Bibliography: Chen, Z., Xu, H., Song, G., Xie, Y., Zhang, C., Chen, X., Wang, C., Chang, D., & Luo, L. X-Dancer: Expressive Music to Human Dance Video Generation. arXiv preprint arXiv:2502.17414 (2025). Tseng, Y. L., Chang, C. Y., Chen, H. T., & Hsu, W. H. EDGE: Editable Dance Generation From Music. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 21605-21614) (2023). Siarohin, A., Lathuile, S., Tulyakov, S., Ricci, E., & Sebe, N. First order motion model for image animation. In Advances in Neural Information Processing Systems (pp. 7137-7147) (2020). Lee, H. Y., Lee, J. Y., & Kim, J. Dancing to music. In Advances in Neural Information Processing Systems (pp. 11234-11244) (2019). Alekseev, A., Lebedev, V., Potapov, A., & Dylov, D. V. Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 12809-12818) (2021). Martinez-Gonzalez, P., Zhang, J., Sun, Y., & Argyriou, V. DeepDance: Music-to-Dance Motion Choreography with Adversarial Learning. Sensors, 24(2), 588 (2024). Lu, C., Chen, W., & Liu, Z. Make-a-Video: Text-to-Video Generation without Text-Video Data. arXiv preprint arXiv:2209.14792 (2022). Shi, W., Zhu, Y., Zhao, S., Zhao, Y., & Loy, C. C. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. arXiv preprint arXiv:2405.03178 (2024). Sun, S., Zhang, Z., Wang, Z., Bao, H., & Zhou, J. Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis. In Proceedings of the 27th ACM International Conference on Multimedia (pp. 324-332) (2019). Van der Maaten, L., & Hinton, G. Visualizing data using t-SNE. Journal of machine learning research, 9(11) (2008).