This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code will be made publicly available at https://github.com/yaolinli/TimeChat-Captioner.
翻译:本文提出了全密集描述生成这一新颖任务,旨在生成具有明确时间戳的、连续、细粒度且结构化的视听叙事。为确保密集的语义覆盖,我们引入了一个六维结构模式来创建“类脚本”描述,使读者能够像阅读电影剧本一样,逐场景生动地想象视频内容。为促进研究,我们构建了OmniDCBench,一个高质量的人工标注基准,并提出了SodaM,一种在缓解场景边界模糊性的同时评估时间感知详细描述的统一度量标准。此外,我们构建了一个训练数据集TimeChatCap-42K,并提出了TimeChat-Captioner-7B,这是一个通过SFT和带有任务特定奖励的GRPO训练得到的强大基线模型。大量实验表明,TimeChat-Captioner-7B实现了最先进的性能,超越了Gemini-2.5-Pro,同时其生成的密集描述显著提升了在视听推理(DailyOmni和WorldSense)和时间定位(Charades-STA)方面的下游任务能力。所有数据集、模型和代码将在https://github.com/yaolinli/TimeChat-Captioner公开。