Replay speech attacks pose a significant threat to voice-controlled systems, especially in smart environments where voice assistants are widely deployed. While multi-channel audio offers spatial cues that can enhance replay detection robustness, existing datasets and methods predominantly rely on single-channel recordings. Moreover, previous studies highlighted that generalization of this attack to new environments is challenging, requiring new methods for generating data encompassing various acoustic conditions. Hence, in this work we introduce an acoustic simulation framework designed to simulate multi-channel replay speech configurations using publicly available resources. Using the framework, we train the state-of-the-art multi-channel replay detector M-ALRAD and evaluate its generalisation on the ReMASC real-recording corpus without any real training data. To improve the exploitation of spatial information, we extend M-ALRAD with inter-channel phase difference features computed for adjacent microphone pairs, augmenting the beamformed representation with directional cues. Synthetic datasets will be available upon acceptance of the paper.
翻译:重放语音攻击对语音控制系统构成重大威胁,尤其在语音助手广泛部署的智能环境中尤为突出。尽管多通道音频提供的空间线索可增强重放检测的鲁棒性,但现有数据集和方法主要依赖单通道录音。此外,先前研究强调,此类攻击对新环境的泛化具有挑战性,亟需开发能涵盖多种声学条件的新数据生成方法。为此,本文提出一种利用公开资源模拟多通道重放语音配置的声学仿真框架。基于该框架,我们训练了当前最先进的多通道重放检测器M-ALRAD,并在完全不使用真实训练数据的情况下,评估其在ReMASC真实录音语料库上的泛化性能。为提升空间信息的利用效率,我们通过为相邻麦克风对计算通道间相位差特征对M-ALRAD进行扩展,为波束赋形表示补充方向性线索。合成数据集将在论文录用后开放获取。