CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation
Abstract
The fidelity and structural diversity of training datasets fundamentally dictate the generative quality and capability boundaries of video synthesis models. Furthermore, powered by the recent advent of foundational audio-visual architectures, conventional single-modality datasets are increasingly insufficient to meet modern research demands. While proprietary commercial systems have recently demonstrated explosive capabilities in generating cinematic narratives, the progress of open-source models remains severely hindered by the profound scarcity of high-quality training data.
To address this critical bottleneck, we introduce CineDance-1M, a large-scale open-source Text-to-Audio-Video (T2AV) dataset engineered specifically for multi-shot long-form generation. Each sequence features an unprecedented average duration of 92.8 seconds and 24.2 continuous shots, natively providing premium audio-visual pairings coupled with configurable structured annotations. The exceptional quality of CineDance-1M is driven by a rigorous three-stage curation pipeline: (i) diverse sourcing and comprehensive cleansing, (ii) film-theory-inspired narrative parsing for coherent shot grouping, and (iii) hierarchical dual-modal captioning utilizing anchor tokens.
To systematically assess this new paradigm, we concurrently propose CineBench, a difficulty-stratified evaluation suite that features a five-dimensional, human-aligned metric system. Furthermore, we expand LTX-2.3 to CineDance, which maintains highly consistent audio-video alignment and identity consistency across complex multi-shot sequences, effectively validating our curation strategy. We anticipate this work will serve as a pivotal foundation, accelerating future research in multi-shot, long-form audio-video generation.