State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode variable-length utterances into fixed-length speaker vectors. The authors have recently proposed the use of a Double Multi-Head Self-Attention pooling for speaker recognition, placed between a CNN-based front-end and a set of fully connected layers. This has shown to be an excellent approach to efficiently select the most relevant features captured by the front-end from the speech signal. In this paper we show excellent experimental results by adapting this architecture to other different speaker characterization tasks, such as emotion recognition, sex classification and COVID-19 detection.
翻译:当前最先进的说话人验证深度学习系统通常基于说话人嵌入提取器。这类架构通常由特征提取前端与池化层组成,旨在将变长语音编码为固定长度的说话人向量。作者近期提出在基于卷积神经网络的前端与全连接层之间引入双多头自注意力池化机制用于说话人识别。该方法能够高效地从语音信号中筛选前端提取的最相关特征,展现出卓越性能。本文通过将该架构迁移至其他说话人特征表征任务(包括情感识别、性别分类及COVID-19检测),展示了优异的实验成果。