The detection of disfluencies such as hesitations, repetitions and false starts commonly found in speech is a widely studied area of research. With a standardised process for evaluation using the Switchboard Corpus, model performance can be easily compared across approaches. This is not the case for disfluency detection research on learner speech, however, where such datasets have restricted access policies, making comparison and subsequent development of improved models more challenging. To address this issue, this paper describes the adaptation of the NICT-JLE corpus, containing approximately 300 hours of English learners' oral proficiency tests, to a format that is suitable for disfluency detection model training and evaluation. Points of difference between the NICT-JLE and Switchboard corpora are explored, followed by a detailed overview of adaptations to the tag set and meta-features of the NICT-JLE corpus. The result of this work provides a standardised train, heldout and test set for use in future research on disfluency detection for learner speech.
翻译:对语音中常见的犹豫、重复和错误起始等非流利现象的检测是一个广泛研究的领域。通过使用Switchboard语料库进行标准化评估流程,不同方法的模型性能可以轻松比较。然而,对于学习者语音的非流利现象检测研究而言,情况并非如此,因为此类数据集具有受限的访问政策,使模型改进的比较和后续发展更具挑战性。为解决这一问题,本文描述了将包含约300小时英语学习者口语水平测试的NICT-JLE语料库,调整为适用于非流利现象检测模型训练和评估的格式。探讨了NICT-JLE语料库与Switchboard语料库之间的差异点,随后详细概述了对NICT-JLE语料库标签集和元特征的调整。本工作的成果提供了一个标准化的训练集、验证集和测试集,可用于未来学习者语音非流利现象检测的研究。