We introduce a sophisticated multi-speaker speech data simulator, specifically engineered to generate multi-speaker speech recordings. A notable feature of this simulator is its capacity to modulate the distribution of silence and overlap via the adjustment of statistical parameters. This capability offers a tailored training environment for developing neural models suited for speaker diarization and voice activity detection. The acquisition of substantial datasets for speaker diarization often presents a significant challenge, particularly in multi-speaker scenarios. Furthermore, the precise time stamp annotation of speech data is a critical factor for training both speaker diarization and voice activity detection. Our proposed multi-speaker simulator tackles these problems by generating large-scale audio mixtures that maintain statistical properties closely aligned with the input parameters. We demonstrate that the proposed multi-speaker simulator generates audio mixtures with statistical properties that closely align with the input parameters derived from real-world statistics. Additionally, we present the effectiveness of speaker diarization and voice activity detection models, which have been trained exclusively on the generated simulated datasets.
翻译:我们引入了一种先进的多说话人语音数据仿真器,专门用于生成多说话人语音录音。该仿真器的一个显著特点是能够通过调整统计参数来调节静默和重叠的分布。这一能力为开发适用于说话人日志和语音活动检测的神经模型提供了定制化的训练环境。获取用于说话人日志的大量数据集通常是一项重大挑战,尤其是在多说话人场景中。此外,语音数据的精确时间戳标注是训练说话人日志和语音活动检测的关键因素。我们提出的多说话人仿真器通过生成大规模音频混合来解决这些问题,这些音频混合保持了与输入参数紧密对齐的统计特性。我们证明,所提出的多说话人仿真器生成的音频混合具有与真实世界统计参数密切对齐的统计特性。此外,我们还展示了仅基于生成的仿真数据集训练的说话人日志和语音活动检测模型的有效性。