Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps remain in speech recognition accuracy.
翻译:目标语音提取通常依赖于预先录制的高质量注册语音,这会破坏用户体验并限制其在自发交互中的可行性。本文提出“基于唤醒词的注册”这一新颖框架,其中在人机交互过程中自然捕获的唤醒词片段被自动用作注册参考。这消除了对预先采集语音的需求,从而实现了无缝体验。我们首次对基于唤醒词的注册目标语音提取进行了系统性研究,在真实多样的声学条件下评估了先进的判别式与生成式模型。鉴于唤醒词片段具有短促且含噪声的特性,我们研究了基于大语言模型的语音合成技术在注册增强中的应用。结果表明,尽管当前目标语音提取模型在基于唤醒词的注册场景下面临性能下降,但语音合成辅助技术能显著提升听觉体验,不过在语音识别准确率方面仍存在差距。