Target Speaker Automatic Speech Recognition (TS-ASR) aims to transcribe the speech of a specified target speaker from multi-speaker mixtures in cocktail party scenarios. Recent advancement of Large Audio-Language Models (LALMs) has already brought some new insights to TS-ASR. However, significant room for optimization remains for the TS-ASR task within the LALMs architecture. While Chain of Thoughts (CoT) and Reinforcement Learning (RL) have proven effective in certain speech tasks, TS-ASR, which requires the model to deeply comprehend speech signals, differentiate various speakers, and handle overlapping utterances is particularly well-suited to a reasoning-guided approach. Therefore, we propose a novel framework that incorporates CoT and RL training into TS-ASR for performance improvement. A novel CoT dataset of TS-ASR is constructed, and the TS-ASR model is first trained on regular data and then fine-tuned on CoT data. Finally, the model is further trained with RL using selected data to enhance generalized reasoning capabilities. Experiment results show a significant improvement of TS-ASR performance with CoT and RL training, which demonstrates the effectiveness of the proposed CoT and RL training methods adapted for the TS-ASR task.
翻译:目标说话人自动语音识别(TS-ASR)旨在从鸡尾酒会场景下的多人语音混合中转录指定目标说话人的语音。大型音频-语言模型(LALMs)的最新进展已为TS-ASR带来了一些新的见解。然而,在LALMs架构内,TS-ASR任务仍有显著的优化空间。尽管思维链(CoT)和强化学习(RL)已在某些语音任务中被证明有效,但TS-ASR需要模型深入理解语音信号、区分不同说话人并处理重叠话语,因此特别适合采用推理引导的方法。为此,我们提出了一种新颖的框架,将CoT和RL训练融入TS-ASR以提升性能。我们构建了一个新颖的TS-ASR CoT数据集,TS-ASR模型首先在常规数据上进行训练,然后在CoT数据上进行微调。最后,使用精选数据通过RL进一步训练模型,以增强其泛化推理能力。实验结果表明,采用CoT和RL训练后,TS-ASR性能得到显著提升,这证明了所提出的、适配于TS-ASR任务的CoT与RL训练方法的有效性。