Current disfluency detection models focus on individual utterances each from a single speaker. However, numerous discontinuity phenomena in spoken conversational transcripts occur across multiple turns, hampering human readability and the performance of downstream NLP tasks. This study addresses these phenomena by proposing an innovative Multi-Turn Cleanup task for spoken conversational transcripts and collecting a new dataset, MultiTurnCleanup1. We design a data labeling schema to collect the high-quality dataset and provide extensive data analysis. Furthermore, we leverage two modeling approaches for experimental evaluation as benchmarks for future research.
翻译:现有言语不流畅检测模型主要聚焦于单说话人的独立话语。然而,口语对话转录文本中存在大量跨越多个话轮的连续性中断现象,这不仅影响文本的可读性,也制约了下游自然语言处理任务的性能。本研究通过提出面向口语对话转录文本的创新性多轮清理任务,并构建了名为MultiTurnCleanup1的新数据集,系统性地解决了上述问题。我们设计了数据标注方案以获取高质量数据集,并进行了详尽的数据分析。此外,我们采用两种建模方法作为实验评估基准,为后续研究提供参考。