The availability of digital devices operated by voice is expanding rapidly. However, the applications of voice interfaces are still restricted. For example, speaking in public places becomes an annoyance to the surrounding people, and secret information should not be uttered. Environmental noise may reduce the accuracy of speech recognition. To address these limitations, a system to detect a user's unvoiced utterance is proposed. From internal information observed by an ultrasonic imaging sensor attached to the underside of the jaw, our proposed system recognizes the utterance contents without the user's uttering voice. Our proposed deep neural network model is used to obtain acoustic features from a sequence of ultrasound images. We confirmed that audio signals generated by our system can control the existing smart speakers. We also observed that a user can adjust their oral movement to learn and improve the accuracy of their voice recognition.
翻译:语音操控的数字设备正快速普及,但语音接口的应用仍受限:公共场所说话会干扰他人,机密信息不宜发声,环境噪声可能降低语音识别准确率。为克服这些局限,本文提出一种能检测用户无声发音的系统。该系统通过置于下颌下方的超声成像传感器获取内部信息,在不发出语音的情况下识别发音内容。我们设计的深层神经网络模型可从超声图像序列中提取声学特征。实验证实,系统生成的音频信号可成功操控现有智能音箱,同时发现用户可通过调整口腔运动来学习并提升语音识别准确率。