Target Sound Extraction (TSE) focuses on the problem of separating sources of interest, indicated by a user's cue, from the input mixture. Most existing solutions operate in an offline fashion and are not suited to the low-latency causal processing constraints imposed by applications in live-streamed content such as augmented hearing. We introduce a family of context-aware low-latency causal TSE models suitable for real-time processing. First, we explore the utility of context by providing the TSE model with oracle information about what sound classes make up the input mixture, where the objective of the model is to extract one or more sources of interest indicated by the user. Since the practical applications of oracle models are limited due to their assumptions, we introduce a composite multi-task training objective involving separation and classification losses. Our evaluation involving single- and multi-source extraction shows the benefit of using context information in the model either by means of providing full context or via the proposed multi-task training loss without the need for full context information. Specifically, we show that our proposed model outperforms size- and latency-matched Waveformer, a state-of-the-art model for real-time TSE.
翻译:目标声音提取(TSE)聚焦于根据用户线索从输入混合信号中分离出感兴趣声源的问题。现有大多数解决方案以离线方式运行,不适用于增强听觉等直播内容应用中低延迟因果处理约束。我们提出了一系列适用于实时处理的上下文感知低延迟因果TSE模型。首先,通过向TSE模型提供输入混合信号中声学类别的先验信息(即模型目标是提取用户指定的一个或多个感兴趣声源),我们探究了上下文的效用。鉴于先验模型的假设会限制其实用性,我们引入了一种包含分离损失和分类损失的复合多任务训练目标。在单源和多源提取任务上的评估表明,利用上下文信息(无论是提供完整上下文,还是通过无需完整上下文的多任务训练损失)均能提升模型性能。具体而言,我们提出的模型在性能上超越了规模与延迟匹配的实时TSE最新模型Waveformer。