Advances in deep learning have enabled the widespread deployment of speaker recognition systems (SRSs), yet they remain vulnerable to score-based impersonation attacks. Existing attacks that operate directly on raw waveforms require a large number of queries due to the difficulty of optimizing in high-dimensional audio spaces. Latent-space optimization within generative models offers improved efficiency, but these latent spaces are shaped by data distribution matching and do not inherently capture speaker-discriminative geometry. As a result, optimization trajectories often fail to align with the adversarial direction needed to maximize victim scores. To address this limitation, we propose an inversion-based generative attack framework that explicitly aligns the latent space of the synthesis model with the discriminative feature space of SRSs. We first analyze the requirements of an inverse model for score-based attacks and introduce a feature-aligned inversion strategy that geometrically synchronizes latent representations with speaker embeddings. This alignment ensures that latent updates directly translate into score improvements. Moreover, it enables new attack paradigms, including subspace-projection-based attacks, which were previously infeasible due to the absence of a faithful feature-to-audio mapping. Experiments show that our method significantly improves query efficiency, achieving competitive attack success rates with on average 10x fewer queries than prior approaches. In particular, the enabled subspace-projection-based attack attains up to 91.65% success using only 50 queries. These findings establish feature-aligned inversion as a key tool for evaluating the robustness of modern SRSs against score-based impersonation threats.
翻译:深度学习的发展推动了说话人识别系统的广泛应用,然而这些系统仍然容易受到基于分数的模仿攻击。现有直接在原始波形上操作的攻击方法由于在高维音频空间中优化的困难性,需要大量查询。生成模型中的潜空间优化提供了更高的效率,但这些潜空间由数据分布匹配所塑造,本身并不捕获说话人判别性几何结构。因此,优化轨迹往往无法与最大化受害者分数所需的对抗方向对齐。为克服这一局限,我们提出了一种基于反转的生成式攻击框架,该框架显式地将合成模型的潜空间与说话人识别系统的判别性特征空间对齐。我们首先分析了基于分数的攻击对反转模型的要求,并提出了一种特征对齐的反转策略,该策略在几何上将潜表示与说话人嵌入同步。这种对齐确保潜空间更新能直接转化为分数提升。此外,它实现了新的攻击范式,包括基于子空间投影的攻击,而由于此前缺乏可靠的特征到音频映射,这类攻击难以实现。实验表明,我们的方法显著提高了查询效率,以平均比先前方法少10倍的查询次数实现了具有竞争力的攻击成功率。特别是,所实现的基于子空间投影的攻击仅使用50次查询即可达到高达91.65%的成功率。这些发现确立了特征对齐反转作为评估现代说话人识别系统抵御基于分数的模仿威胁鲁棒性的关键工具。