In this paper, we propose ACA-Net, a lightweight, global context-aware speaker embedding extractor for Speaker Verification (SV) that improves upon existing work by using Asymmetric Cross Attention (ACA) to replace temporal pooling. ACA is able to distill large, variable-length sequences into small, fixed-sized latents by attending a small query to large key and value matrices. In ACA-Net, we build a Multi-Layer Aggregation (MLA) block using ACA to generate fixed-sized identity vectors from variable-length inputs. Through global attention, ACA-Net acts as an efficient global feature extractor that adapts to temporal variability unlike existing SV models that apply a fixed function for pooling over the temporal dimension which may obscure information about the signal's non-stationary temporal variability. Our experiments on the WSJ0-1talker show ACA-Net outperforms a strong baseline by 5\% relative improvement in EER using only 1/5 of the parameters.
翻译:本文提出ACA-Net,一种轻量级、全局上下文感知的说话人嵌入提取器,用于说话人验证(SV)。该方法通过使用非对称交叉注意力(ACA)替代时序池化,改进了现有工作。ACA通过让小查询(query)与大型键(key)和值(value)矩阵进行注意力交互,能够将长度可变的大型序列压缩为小型、固定尺寸的潜在表示。在ACA-Net中,我们利用ACA构建了多层聚合(MLA)模块,可从可变长度输入生成固定尺寸的身份向量。通过全局注意力机制,ACA-Net作为一种高效的全局特征提取器,能够适应时序可变性,而现有SV模型通常对时序维度应用固定池化函数,这可能会掩盖信号中非平稳时序可变性的信息。我们在WSJ0-1talker数据集上的实验表明,ACA-Net在仅使用1/5参数的情况下,相比强基线模型实现了等错误率(EER)5%的相对提升。