Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost. Together, our resources and methods establish paragraph segmentation as a standardized, practical task in speech processing.
翻译:自动语音转录文本通常以无结构词流形式呈现,这阻碍了可读性与二次利用。我们将段落切分重新定义为缺失的结构化步骤,并在语音处理与文本切分的交叉领域填补了三项空白。首先,我们建立了TEDPara(人工标注的TED演讲)与YTSegPara(含合成标签的YouTube视频)作为段落切分任务的首批基准数据集。这些基准聚焦于尚未充分探索的语音领域——该领域的后处理流程传统上未包含段落切分,同时也为更广泛的文本切分领域作出贡献,该领域目前仍缺乏稳健且贴近现实的基准。其次,我们提出一种约束解码方案,使大语言模型能在保持原始转录文本完整性的同时插入段落分隔符,从而实现忠实于原文且句子对齐的评估。第三,我们证明紧凑模型(MiniSeg)能达到最先进的准确率,且通过层级扩展能以最小计算成本联合预测章节与段落。综合而言,我们的资源与方法将段落切分确立为语音处理中标准化、实用化的任务。