Real-time target speaker extraction (TSE) is intended to extract the desired speaker's voice from the observed mixture of multiple speakers in a streaming manner. Implementing real-time TSE is challenging as the computational complexity must be reduced to provide real-time operation. This work introduces to Conv-TasNet-based TSE a new architecture based on state space modeling (SSM) that has been shown to model long-term dependency effectively. Owing to SSM, fewer dilated convolutional layers are required to capture temporal dependency in Conv-TasNet, resulting in the reduction of model complexity. We also enlarge the window length and shift of the convolutional (TasNet) frontend encoder to reduce the computational cost further; the performance decline is compensated by over-parameterization of the frontend encoder. The proposed method reduces the real-time factor by 78% from the conventional causal Conv-TasNet-based TSE while matching its performance.
翻译:实时目标说话人提取旨在以流式方式从观测到的多人混合语音中提取目标说话人的语音。实现实时目标说话人提取具有挑战性,因为必须降低计算复杂度以实现实时运行。本文为基于Conv-TasNet的目标说话人提取引入了一种基于状态空间建模的新架构,该架构已被证明能有效建模长期依赖关系。得益于状态空间建模,Conv-TasNet中捕获时序依赖所需的膨胀卷积层数得以减少,从而降低了模型复杂度。我们还增大了卷积前端编码器的窗口长度与移位步长,以进一步降低计算成本;由此带来的性能下降通过前端编码器的过参数化得到补偿。所提方法在性能与常规因果Conv-TasNet基目标说话人提取方法相当的同时,将实时因子降低了78%。