Speech enhancement performance degrades significantly in noisy environments, limiting the deployment of speech-controlled technologies in industrial settings, such as manufacturing plants. Existing speech enhancement solutions primarly rely on advanced digital signal processing techniques, deep learning methods, or complex software optimization techniques. This paper introduces a novel enhancement strategy that incorporates a physical optimization stage by dynamically modifying the geometry of a microphone array to adapt to changing acoustic conditions. A sixteen-microphone array is mounted on a robotic arm manipulator with seven degrees of freedom, with microphones divided into four groups of four, including one group positioned near the end-effector. The system reconfigures the array by adjusting the manipulator joint angles to place the end-effector microphones closer to the target speaker, thereby improving the reference signal quality. This proposed method integrates sound source localization techniques, computer vision, inverse kinematics, minimum variance distortionless response beamformer and time-frequency masking using a deep neural network. Experimental results demonstrate that this approach outperforms other traditional recording configruations, achieving higher scale-invariant signal-to-distortion ratio and lower word error rate accross multiple input signal-to-noise ratio conditions.
翻译:在嘈杂环境中,语音增强性能显著下降,限制了语音控制技术在工业环境(如制造工厂)中的部署。现有语音增强解决方案主要依赖于先进的数字信号处理技术、深度学习方法或复杂的软件优化技术。本文提出一种新颖的增强策略,通过动态调整麦克风阵列的几何构型来适应变化的声学条件,从而引入物理优化阶段。一个包含十六个麦克风的阵列安装在具有七自由度的机械臂操作器上,麦克风分为四组(每组四个),其中一组靠近末端执行器布置。该系统通过调整操作器关节角度来重新配置阵列,使末端执行器麦克风更靠近目标说话人,从而提高参考信号质量。该方法集成了声源定位技术、计算机视觉、逆运动学、最小方差无失真响应波束成形器以及基于深度神经网络的时频掩蔽技术。实验结果表明,该方法在多种输入信噪比条件下均优于传统录音配置,实现了更高的尺度不变信噪比和更低的词错误率。