CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

Chen, Yuheng; Hu, Teng; Wang, Yuji; He, Qingdong; Xue, Zhucun; Zhou, Qianyu; Li, Jason; Ma, Lizhuang; Zhang, Jiangning; Tao, Dacheng

Yuheng Chen¹, Teng Hu¹, Yuji Wang¹, Qingdong He², Zhucun Xue³, Qianyu Zhou⁴, Jason Li⁵, Lizhuang Ma¹, Jiangning Zhang³, Dacheng Tao⁵

¹Shanghai Jiao Tong University ²University of Electronic Science and Technology of China ³Zhejiang University ⁴The University of Tokyo ⁵Nanyang Technological University

Paper arXiv Code Dataset

📢 The dataset is large in scale, exceeding 80 TB. To save time before training, we did not apply hard cropping to the raw videos; instead, we used slice-based soft cropping guided by metadata. We are now performing hard cropping, which is a purely CPU-bound and highly CPU-intensive task, so the processing time is expected to be substantial. Meanwhile, the authors are discussing the open-source release approach for the dataset, and we plan to adopt a gated access mechanism. If you have any questions, feel free to reach out at fengjianliuli627@gmail.com 😊. Thanks for your attention 🙏!

Abstract

The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems show remarkable ability to generate cinematic narratives, the progress of open-source models remains limited by the scarcity of high-quality training data.

To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This quality is achieved through a rigorous three-stage curation pipeline: (i) diverse sourcing and comprehensive cleansing, (ii) film-theory-inspired narrative parsing, and (iii) hierarchical dual-modal captioning.

For comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates strong single-modality quality, precise audio-video alignment, and robust subject and environment consistency.

CineDance-1M

CineDance-1M targets the missing training unit for modern cinematic generation: not isolated short clips, but long audio-video sequences with consistent shot structure, aligned sound, and reusable annotations. The curation pipeline consists of three stages: data preparation and quality assessment, bottom-up multi-shot narrative parsing, and configurable structured dual-modal annotation.

Dataset

Res.

Avg. dur.

Avg. shots

Shot caps.

Audio

Audio ann.

Cap. len.

Total dur.

Clips

Year

HowTo100M

240p

3.6s

None

134.5Khr

136M

2019

HD-VILA-100M

720p

13.4s

None

32.5

371.5Khr

103M

2022

Koala-36M

720p

13.6s

None

202.3

137Khr

36M

2024

VIDGEN-1M

720p

10.6s

None

Partial

None

89.3

2.9Khr

2024

MiraData

720p

72.1s

7.15

None

319

6.6Khr

330K

2024

LVD-2M

720p

20.2s

1.86

None

88.8

14.6Khr

2.1M

2024

OpenHumanVid

720p

4.6s

None

All

None

99.7

12Khr

16M

2025

OpenS2V-5M

720p

5.6s

None

Partial

None

312.06

5.8Khr

3.75M

2025

UltraVideo

4K/8K

5.3s

1.17

None

824.3

62hr

42K

2025

OpenVid-1M

720p

7.2s

None

Partial

None

126.5

2.1Khr

2025

CineTrans

720p

10.7s

2.53

None

250.78

752hr

252K

2025

SpeakerVid-5M

1080p

8.3s

1.27

None

All

ASR

20.69

11.6Khr

5.07M

2025

CineDance-1M

1080p

92.8s

24.2

All

Structured

6496.3

26.3Khr

2026

Dataset Quality And Diversity

Statistical overview of the CineDance-1M dataset

CineDance-1M is analyzed across taxonomy, quality, duration, annotation density, narrative structure, and semantic vocabulary. These statistics support filtering, benchmark construction, and controllable long-form generation research.

CineBench

CineBench evaluates whether a model can synthesize temporally ordered multi-shot sequences from structured conditions. It contains 1,000 test cases stratified by theme/style, duration and shot count, and generation difficulty. The benchmark covers 10s with 2-3 shots, 30s with 4-9 shots, and forward-looking 60s with 10-20 shots.

CineBench. The benchmark combines diverse taxonomic flow, quality distributions, and semantic vocabulary for long-form multi-shot audio-video evaluation.

Video Qualityfidelity, imaging quality, motion smoothness

Audio Qualityspeech clarity, background sound, acoustic artifacts

AV Synclip synchronization and semantic audio-video alignment

Prompt Alignmentcharacters, scenes, events, dialogue, sound descriptions

Narrative Continuityidentity, scene, object persistence, ordered events

Shot Structure Responseshot count, transitions, and temporal shot layout

CineDance Model

We adapt LTX-2.3 into CineDance as a robust open baseline for multi-shot long-form audio-video generation. The model uses the native joint audio-video backbone while learning shot-transition capability, subject and environment consistency, and structured prompt response from CineDance-1M.

The training strategy uses visual-temporal reference scaffolds as optimization aids, then progressively removes them so the model can retain multi-shot organization under reduced inference conditions.

BibTeX

@misc{chen2026cinedancenextgenerationmultishotlongform, title={CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation}, author={Yuheng Chen and Teng Hu and Yuji Wang and Qingdong He and Zhucun Xue and Qianyu Zhou and Jason Li and Lizhuang Ma and Jiangning Zhang and Dacheng Tao}, year={2026}, eprint={2606.09639}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2606.09639}, }

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

Abstract

CineDance-1M

Dataset Comparison

Dataset Quality And Diversity

CineBench

CineDance Model

BibTeX