Overlapped Speech Detection (OSD) is an important part of speech applications involving analysis of multi-party conversations. However, Most of the existing OSD models are trained and evaluated on specific dataset, which limits the application scenarios of these models. In order to solve this problem, we conduct a study of large-scale learning (LSL) in OSD and propose a more general 16K single-channel OSD model. In our study, 522 hours of labeled audio in different languages and styles are collected and used as the large-scale dataset. Rigorous comparative experiments are designed and used to evaluate the effectiveness of LSL in OSD task and the performance of OSD models based on different deep neural networks. The results show that LSL can significantly improve the performance and robustness of OSD models, and the OSD model based on Conformer (CF-OSD) with LSL is currently the best 16K single-channel OSD model. Moreover, the CF-OSD with LSL establishes a state-of-the-art performance with a F1-score of 80.8% and 52.0% on the Alimeeting test set and DIHARD II evaluation set, respectively.
翻译:重叠语音检测(OSD)是多说话人对话分析中语音应用的重要组成部分。然而,现有大多数OSD模型均在特定数据集上训练和评估,这限制了这些模型的应用场景。为解决该问题,我们开展了大规模学习(LSL)在OSD中的研究,并提出更通用的16K单通道OSD模型。研究中收集了522小时不同语言与风格的标注音频作为大规模数据集,通过设计严格的对比实验评估LSL在OSD任务中的有效性,以及基于不同深度神经网络的OSD模型性能。结果表明,LSL能显著提升OSD模型的性能与鲁棒性,其中基于Conformer并采用LSL的OSD模型(CF-OSD)当前为最优16K单通道OSD模型。此外,采用LSL的CF-OSD在Alimeeting测试集和DIHARD II评估集上分别取得了80.8%和52.0%的F1分数,达到当前最优性能。