Target-speaker speech processing (TS) tasks, such as target-speaker automatic speech recognition (TS-ASR), target speech extraction (TSE), and personal voice activity detection (p-VAD), are important for extracting information about a desired speaker's speech even when it is corrupted by interfering speakers. While most studies have focused on training schemes or system architectures for each specific task, the auxiliary network for embedding target-speaker cues has not been investigated comprehensively in a unified cross-task evaluation. Therefore, this paper aims to address a fundamental question: what is the preferred speaker embedding for TS tasks? To this end, for the TS-ASR, TSE, and p-VAD tasks, we compare pre-trained speaker encoders (i.e., self-supervised or speaker recognition models) that compute speaker embeddings from pre-recorded enrollment speech of the target speaker with ideal speaker embeddings derived directly from the target speaker's identity in the form of a one-hot vector. To further understand the properties of ideal speaker embedding, we optimize it using a gradient-based approach to improve performance on the TS task. Our analysis reveals that speaker verification performance is somewhat unrelated to TS task performances, the one-hot vector outperforms enrollment-based ones, and the optimal embedding depends on the input mixture.
翻译:目标说话人语音处理任务(如目标说话人自动语音识别、目标语音提取和个人语音活动检测)对于在受干扰说话人影响的情况下提取目标说话人语音信息至关重要。尽管多数研究集中于针对各具体任务的训练方案或系统架构,但用于嵌入目标说话人线索的辅助网络尚未在统一的跨任务评估中得到全面研究。因此,本文旨在解决一个基础性问题:何种说话人嵌入最适合目标说话人任务?为此,针对目标说话人语音识别、目标语音提取和个人语音活动检测任务,我们比较了预训练说话人编码器(即自监督模型或说话人识别模型)——这些编码器通过目标说话人事先录制的注册语音计算说话人嵌入——与直接从目标说话人身份以独热向量形式导出的理想说话人嵌入。为深入理解理想说话人嵌入的特性,我们采用基于梯度的优化方法对其进行优化以提升目标说话人任务性能。分析表明:说话人验证性能与目标说话人任务性能关联度有限,独热向量优于基于注册语音的嵌入,且最优嵌入取决于输入混合语音的构成。