The performance of speaker verification degrades significantly in adverse acoustic environments with strong reverberation and noise. To address this issue, this paper proposes a spatial-temporal graph convolutional network (GCN) method for the multi-channel speaker verification with ad-hoc microphone arrays. It includes a feature aggregation block and a channel selection block, both of which are built on graphs. The feature aggregation block fuses speaker features among different time and channels by a spatial-temporal GCN. The graph-based channel selection block discards the noisy channels that may contribute negatively to the system. The proposed method is flexible in incorporating various kinds of graphs and prior knowledge. We compared the proposed method with six representative methods in both real-world and simulated environments. Experimental results show that the proposed method achieves a relative equal error rate (EER) reduction of $\mathbf{15.39\%}$ lower than the strongest referenced method in the simulated datasets, and $\mathbf{17.70\%}$ lower than the latter in the real datasets. Moreover, its performance is robust across different signal-to-noise ratios and reverberation time.
翻译:在强混响和噪声等恶劣声学环境下,说话人验证性能显著下降。针对这一问题,本文提出了一种基于时空图卷积网络(GCN)的自组麦克风阵列多通道说话人验证方法。该方法包含特征聚合模块和通道选择模块,两者均基于图结构构建。特征聚合模块通过时空GCN融合不同时间与通道间的说话人特征;基于图的通道选择模块则剔除可能对系统产生负面影响的噪声通道。所提方法能灵活整合多种图结构与先验知识。我们在真实与仿真环境中将其与六种代表性方法进行了对比。实验结果表明,在仿真数据集上,所提方法比最强基准方法的相对等错误率(EER)降低了$\mathbf{15.39\%}$,在真实数据集上比后者降低了$\mathbf{17.70\%}$。此外,该方法在不同信噪比和混响时间条件下均表现出稳健性能。