One of the methods for language Identification (LID) involves deriving speech representation from pre-trained models using self-supervised learning, followed by fine-tuning the model for the LID task. State-of-the-art approaches for LID use an attention-based statistical pooling layer to facilitate the aggregation of contextual information across time frames of the embedding vectors extracted from the pre-trained model. In this paper, we delve into exploring recently proposed attention mechanisms, namely performer and agent-attention, in conjunction with the statistical pooling layer. The LID experiments are performed on three datasets: VoxPopuli, FLEURS, and VoxLingua. We compare their performance against vanilla self-attention. Our findings suggest that performer-attention outperforms self-attention and agent-attention exhibits comparable or occasionally superior performance to self-attention, while also being computationally less expensive.
翻译:语言识别(LID)的一种方法涉及利用自监督学习从预训练模型中提取语音表征,随后针对LID任务对模型进行微调。当前最先进的LID方法采用基于注意力的统计池化层,以促进从预训练模型中提取的嵌入向量在时间帧间的上下文信息聚合。本文深入探讨了近期提出的注意力机制——即Performer注意力与Agent注意力——与统计池化层的结合应用。语言识别实验在三个数据集上进行:VoxPopuli、FLEURS和VoxLingua。我们将它们的性能与原始自注意力机制进行对比。研究结果表明,Performer注意力的表现优于自注意力,而Agent注意力则展现出与自注意力相当或偶尔更优的性能,同时计算成本更低。