Maliciously-created fake speech, including deepfaked and spoofed audio, is proliferating at an alarming rate, and detection models are racing to stay ahead of the curve. Yet, most detection models are trained to make inference on frame-level audio features alone without leveraging valuable linguistic cues at larger timescales. To address this gap, we present Linguistically Augmented Audio Speech Data (LinguAS), a dataset of genuine and deepfaked audio samples annotated with five strategically-chosen, Expert-Defined Linguistic Features (EDLFs) that occur frequently in spoken English and are characteristic of natural human speech. LinguAS contains over 800 audio samples, each of which are annotated with EDLFs. The dataset has a balanced number of four spoofed audio attack types and a proportionate number of genuine speech samples. We also include metadata on speaker gender and the generator/source for each spoofed audio sample, offering more granularity for model training. We found that models trained on data augmented with EDLFs had improved model performance significantly beyond the ASVspoof 2021 deep learning baselines and SSL models like HuBert and XLSR. LinguAS's augmented linguistic, gender, and generator metadata provide audio deepfake researchers with a dataset that emphasizes real human language traits to improve model inference of faked speech. Data and code are publicly available.
翻译:恶意生成的虚假音频,包括深度伪造和欺骗性音频,正以惊人的速度扩散,检测模型正努力保持领先地位。然而,大多数检测模型仅基于帧级音频特征进行推理,未能利用更大时间尺度上的宝贵语言线索。为弥补这一空白,我们提出了语言增强音频语音数据(LinguAS),这是一个包含真实与深度伪造音频样本的数据集,标注了五种策略性选择的、由专家定义的语言特征(EDLFs),这些特征在英语口语中频繁出现且是自然人类语音的典型特征。LinguAS包含800多个音频样本,每个样本均标注了EDLFs。该数据集均衡包含了四种欺骗性音频攻击类型,并配以相应数量的真实语音样本。我们还提供了每个欺骗性音频样本的说话者性别及生成器/来源的元数据,为模型训练提供更细粒度信息。我们发现,基于EDLFs增强数据训练的模型,其性能显著超越了ASVspoof 2021深度学习基线模型及HuBert、XLSR等SSL模型。LinguAS增强的语言、性别和生成器元数据为音频深度伪造研究者提供了一个强调真实人类语言特征的数据集,以改进模型对虚假语音的推理能力。数据和代码均已公开。