Accurate dialogue description in audiovisual video captioning is crucial for downstream understanding and generation tasks. However, existing models generally struggle to produce faithful dialogue descriptions within audiovisual captions. To mitigate this limitation, we propose DiaDem, a powerful audiovisual video captioning model capable of generating captions with more precise dialogue descriptions while maintaining strong overall performance. We first synthesize a high-quality dataset for SFT, then employ a difficulty-partitioned two-stage GRPO strategy to further enhance dialogue descriptions. To enable systematic evaluation of dialogue description capabilities, we introduce DiaDemBench, a comprehensive benchmark designed to evaluate models across diverse dialogue scenarios, emphasizing both speaker attribution accuracy and utterance transcription fidelity in audiovisual captions. Extensive experiments on DiaDemBench reveal even commercial models still exhibit substantial room for improvement in dialogue-aware captioning. Notably, DiaDem not only outperforms the Gemini series in dialogue description accuracy but also achieves competitive performance on general audiovisual captioning benchmarks, demonstrating its overall effectiveness.
翻译:在视听视频描述中,准确的对话描述对于下游理解与生成任务至关重要。然而,现有模型普遍难以在视听描述中生成忠实于原内容的对话描述。为缓解这一局限,我们提出了DiaDem,一个强大的视听视频描述模型,能够在保持强劲整体性能的同时,生成具有更精确对话描述的字幕。我们首先合成一个用于监督微调的高质量数据集,随后采用难度分区的两阶段GRPO策略,以进一步增强对话描述。为了系统评估对话描述能力,我们引入了DiaDemBench,这是一个综合性基准测试,旨在评估模型在多样化对话场景下的表现,重点关注视听描述中说话人归属的准确性和话语转录的保真度。在DiaDemBench上进行的大量实验表明,即使是商业模型在对话感知描述方面仍有显著的提升空间。值得注意的是,DiaDem不仅在对话描述准确性上超越了Gemini系列模型,还在通用视听视频描述基准测试中取得了有竞争力的性能,证明了其整体有效性。