This paper introduces the Extended Length Audio Dataset for Synthetic Voice Detection and Speaker Recognition (ELAD SVDSR), a resource specifically designed to facilitate the creation of high quality deepfakes and support the development of detection systems trained against them. The dataset comprises 45 minute audio recordings from 36 participants, each reading various newspaper articles recorded under controlled conditions and captured via five microphones of differing quality. By focusing on extended duration audio, ELAD SVDSR captures a richer range of speech attributes such as pitch contours, intonation patterns, and nuanced delivery enabling models to generate more realistic and coherent synthetic voices. In turn, this approach allows for the creation of robust deepfakes that can serve as challenging examples in datasets used to train and evaluate synthetic voice detection methods. As part of this effort, 20 deepfake voices have already been created and added to the dataset to showcase its potential. Anonymized metadata accompanies the dataset on speaker demographics. ELAD SVDSR is expected to spur significant advancements in audio forensics, biometric security, and voice authentication systems.
翻译:本文介绍了用于合成语音检测与说话人识别的扩展时长音频数据集(ELAD-SVDSR),该资源专为促进高质量深度伪造音频的生成,并支持开发针对此类伪造音频的检测系统而设计。该数据集包含36名参与者各45分钟的音频录音,每名参与者在受控条件下朗读不同报纸文章,并通过五种不同质量的麦克风采集。通过聚焦于扩展时长的音频,ELAD-SVDSR能够捕捉更丰富的语音属性,如基频轮廓、语调模式和细微的表达方式,从而使模型能够生成更真实、连贯的合成语音。相应地,这种方法有助于创建鲁棒的深度伪造音频,可作为训练和评估合成语音检测方法的数据集中具有挑战性的样本。作为本项工作的一部分,目前已创建20个深度伪造语音并添加至数据集中,以展示其潜力。数据集附带有经过匿名化处理的说话人人口统计学元数据。ELAD-SVDSR有望推动音频取证、生物特征识别安全和语音认证系统领域的显著进展。