Spatial information is a critical clue for multi-channel multi-speaker target speech recognition. Most state-of-the-art multi-channel Automatic Speech Recognition (ASR) systems extract spatial features only during the speech separation stage, followed by standard single-channel ASR on the separated speech. This approach results in an inefficient, lengthy pipeline and sub-optimal ASR performance due to the accumulated errors from preprocessing modules. Furthermore, most spatial feature extraction methods depend on the knowledge of speaker positions and microphone topology, making the systems reliant on specific settings and challenging to adapt to new equipment. In this work, we propose a solution to these issues with a lightweight embedding module named SpatialEmb, which extracts and encodes spatial information directly for the ASR model, supporting both fixed and arbitrary microphone topology. We conduct comprehensive experiments on AliMeeting, a real meeting corpus, to determine the optimal model design for SpatialEmb in terms of both performance and efficiency. Our best model trained with 105 hours Train-Ali-far achieves 17.04% and 20.32% character error rates (CER) on the Eval and Test sets, establishing a new state-of-the-art result with the same training data.
翻译:空间信息是多通道多说话人目标语音识别的关键线索。当前最先进的多通道自动语音识别系统大多仅在语音分离阶段提取空间特征,随后对分离后的语音进行标准的单通道ASR识别。这种方案导致流程冗长低效,且由于预处理模块的误差累积,ASR性能难以达到最优。此外,多数空间特征提取方法依赖于说话人位置与麦克风拓扑的先验知识,使得系统对特定配置存在依赖,难以适配新设备。本研究提出一种轻量级嵌入模块SpatialEmb以解决上述问题,该模块可直接为ASR模型提取并编码空间信息,同时支持固定与任意拓扑的麦克风阵列。我们在真实会议数据集AliMeeting上进行了系统实验,以确定SpatialEmb在性能与效率方面的最优模型设计。使用105小时Train-Ali-far数据训练的最佳模型在Eval和Test集上分别达到17.04%与20.32%的字错误率,在同等训练数据条件下取得了当前最优性能。