A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction

With current state-of-the-art automatic speech recognition (ASR) systems, it is not possible to transcribe overlapping speech audio streams separately. Consequently, when these ASR systems are used as part of a social robot like Pepper for interaction with a human, it is common practice to close the robot's microphone while it is talking itself. This prevents the human users to interrupt the robot, which limits speech-based human-robot interaction. To enable a more natural interaction which allows for such interruptions, we propose an audio processing pipeline for filtering out robot's ego speech using only a single-channel microphone. This pipeline takes advantage of the possibility to feed the robot ego speech signal, generated by a text-to-speech API, as training data into a machine learning model. The proposed pipeline combines a convolutional neural network and spectral subtraction to extract overlapping human speech from the audio recorded by the robot-embedded microphone. When evaluating on a held-out test set, we find that this pipeline outperforms our previous approach to this task, as well as state-of-the-art target speech extraction systems that were retrained on the same dataset. We have also integrated the proposed pipeline into a lightweight robot software development framework to make it available for broader use. As a step towards demonstrating the feasibility of deploying our pipeline, we use this framework to evaluate the effectiveness of the pipeline in a small lab-based feasibility pilot using the social robot Pepper. Our results show that when participants interrupt the robot, the pipeline can extract the participant's speech from one-second streaming audio buffers received by the robot-embedded single-channel microphone, hence in near-real time.

翻译：当前最先进的自动语音识别系统无法对重叠的语音音频流进行独立转录。因此，当这些ASR系统作为社交机器人（如Pepper）与人类交互的组成部分时，通常的做法是在机器人自身说话时关闭其麦克风。这阻碍了人类用户打断机器人的可能性，从而限制了基于语音的人机交互。为实现允许此类打断的更自然交互，我们提出了一种仅使用单通道麦克风过滤机器人自我语音的音频处理流水线。该流水线利用文本转语音API生成的机器人自我语音信号作为训练数据输入机器学习模型。所提出的流水线结合卷积神经网络与谱减法，从机器人嵌入式麦克风录制的音频中提取重叠的人类语音。在保留测试集上的评估结果表明，该流水线优于我们先前针对此任务的方法，以及在相同数据集上重新训练的最先进目标语音提取系统。我们还将该流水线集成至轻量级机器人软件开发框架中，以促进其更广泛的应用。作为验证流水线部署可行性的初步尝试，我们利用该框架在基于实验室的小型可行性试点中，使用社交机器人Pepper评估流水线的有效性。实验结果显示，当参与者打断机器人时，该流水线能够从机器人嵌入式单通道麦克风接收的一秒流式音频缓冲区中提取参与者语音，从而实现近实时处理。