Target speech extraction (TSE) typically relies on pre-recorded high-quality enrollment speech, which disrupts user experience and limits feasibility in spontaneous interaction. In this paper, we propose Enroll-on-Wakeup (EoW), a novel framework where the wake-word segment, captured naturally during human-machine interaction, is automatically utilized as the enrollment reference. This eliminates the need for pre-collected speech to enable a seamless experience. We perform the first systematic study of EoW-TSE, evaluating advanced discriminative and generative models under real diverse acoustic conditions. Given the short and noisy nature of wake-word segments, we investigate enrollment augmentation using LLM-based TTS. Results show that while current TSE models face performance degradation in EoW-TSE, TTS-based assistance significantly enhances the listening experience, though gaps remain in speech recognition accuracy.
翻译:目标语音提取通常依赖于预先录制的高质量注册语音,这会破坏用户体验并限制在自发交互中的可行性。本文提出Enroll-on-Wakeup框架,该创新方案将人机交互过程中自然捕获的唤醒词片段自动用作注册参考,从而无需预先采集语音即可实现无缝体验。我们首次对EoW-TSE进行了系统性研究,在真实多样声学条件下评估了先进的判别式与生成式模型。针对唤醒词片段短促且含噪的特性,我们探究了基于LLM的TTS注册增强技术。结果表明:虽然现有TSE模型在EoW-TSE中面临性能衰减,但TTS辅助能显著提升听觉体验,不过在语音识别准确率方面仍存在差距。