Audio chaptering, the task of automatically segmenting long-form audio into coherent sections, is increasingly important for navigating podcasts, lectures, and videos. Despite its relevance, research remains limited and text-based, leaving key questions unresolved about leveraging audio information, handling ASR errors, and transcript-free evaluation. We address these gaps through three contributions: (1) a systematic comparison between text-based models with acoustic features, a novel audio-only architecture (AudioSeg) operating on learned audio representations, and multimodal LLMs; (2) empirical analysis of factors affecting performance, including transcript quality, acoustic features, duration, and speaker composition; and (3) formalized evaluation protocols contrasting transcript-dependent text-space protocols with transcript-invariant time-space protocols. Our experiments on YTSeg reveal that AudioSeg substantially outperforms text-based approaches, pauses provide the largest acoustic gains, and MLLMs remain limited by context length and weak instruction following, yet MLLMs are promising on shorter audio.
翻译:音频章节划分,即自动将长音频分割为连贯段落的任务,对于播客、讲座和视频的导航日益重要。尽管其相关性显著,相关研究仍较为有限且主要基于文本,导致在利用音频信息、处理ASR错误以及无文本转录评估等关键问题上尚未得到解决。我们通过三项贡献填补这些空白:(1)系统比较了结合声学特征的文本模型、基于学习音频表征的新型纯音频架构(AudioSeg)以及多模态大语言模型;(2)实证分析了影响性能的因素,包括转录文本质量、声学特征、时长和说话人构成;(3)形式化评估方案,对比了依赖转录的文本空间方案与不依赖转录的时间空间方案。我们在YTSeg数据集上的实验表明:AudioSeg显著优于基于文本的方法,停顿信息带来最大的声学性能增益,多模态大语言模型仍受限于上下文长度和指令跟随能力,但在较短音频上展现出潜力。