End-to-end diarization presents an attractive alternative to standard cascaded diarization systems because a single system can handle all aspects of the task at once. Many flavors of end-to-end models have been proposed but all of them require (so far non-existing) large amounts of annotated data for training. The compromise solution consists in generating synthetic data and the recently proposed simulated conversations (SC) have shown remarkable improvements over the original simulated mixtures (SM). In this work, we create SC with multiple speakers per conversation and show that they allow for substantially better performance than SM, also reducing the dependence on a fine-tuning stage. We also create SC with wide-band public audio sources and present an analysis on several evaluation sets. Together with this publication, we release the recipes for generating such data and models trained on public sets as well as the implementation to efficiently handle multiple speakers per conversation and an auxiliary voice activity detection loss.
翻译:端到端说话人日志是标准级联式说话人日志系统的一种有吸引力的替代方案,因为单一系统可同时处理该任务的所有方面。目前已提出多种端到端模型变体,但所有模型都需要大量(目前尚不存在的)带标注数据进行训练。折衷解决方案是生成合成数据,而近期提出的模拟对话在原始模拟混合基础上展现出显著改进。本研究中,我们构建了每段对话包含多个说话人的模拟对话,并证明其性能显著优于模拟混合数据,同时降低了对微调阶段的依赖。我们还利用宽带公共音频源创建了模拟对话,并在多个评估集上进行了分析。随本文发布,我们公开了生成此类数据与训练模型的配方、基于公共数据集训练的模型实现,以及高效处理每段对话多个说话人的方法及辅助语音活动检测损失函数。