Traditional speaker diarization systems have primarily focused on constrained scenarios such as meetings and interviews, where the number of speakers is limited and acoustic conditions are relatively clean. To explore open-world speaker diarization, we extend this task to the visual media domain, encompassing complex audiovisual programs such as films and TV series. This new setting introduces several challenges, including long-form video understanding, a large number of speakers, cross-modal asynchrony between audio and visual cues, and uncontrolled in-the-wild variability. To address these challenges, we propose Cinematic Speaker Registration & Diarization (CineSRD), a unified multimodal framework that leverages visual, acoustic, and linguistic cues from video, speech, and subtitles for speaker annotation. CineSRD first performs visual anchor clustering to register initial speakers and then integrates an audio language model for speaker turn detection, refining annotations and supplementing unregistered off-screen speakers. Furthermore, we construct and release a dedicated speaker diarization benchmark for visual media that includes Chinese and English programs. Experimental results demonstrate that CineSRD achieves superior performance on the proposed benchmark and competitive results on conventional datasets, validating its robustness and generalizability in open-world visual media settings.
翻译:传统说话人日志系统主要聚焦于会议、访谈等受限场景,其中说话人数有限且声学环境相对干净。为探索开放世界说话人日志,我们将该任务拓展至视觉媒体领域,涵盖电影、电视剧等复杂的音视频节目。这一新设定带来了诸多挑战,包括长视频理解、大量说话人、音频与视觉线索的跨模态异步性,以及不受控的真实世界变异性。为应对这些挑战,我们提出电影级说话人注册与日志(CineSRD),这是一个统一的多模态框架,利用来自视频、语音与字幕的视觉、声学及语言线索进行说话人标注。CineSRD首先执行视觉锚点聚类以注册初始说话人,随后集成音频语言模型进行说话人话轮检测,从而优化标注并补充未注册的画外说话人。此外,我们构建并发布了一个专门针对视觉媒体的说话人日志基准数据集,包含中文与英文节目。实验结果表明,CineSRD在所提基准上表现出优越性能,并在传统数据集上取得竞争性结果,验证了其在开放世界视觉媒体场景中的鲁棒性与泛化能力。