Sound source distance estimation (SDE) is a critical capability in human-robot interaction. An inappropriate interaction distance not only reduces the reliability of speech acquisition and understanding, but also compromises the naturalness and comfort of the interaction. Most existing SDE methods rely on microphone arrays, however, multi-microphone systems typically require careful hardware synchronization, geometric calibration, and additional space and computational resources, which limits applicability to size-constrained and computability-limited embodied platforms. To alleviate these issues, we propose Fast-SDE, a lightweight single-microphone SDE framework that is suited for deployment on robot platforms with limited computational resources and strict size constraints. Specifically, Fast-SDE employs a subband-based backbone that decomposes the frequency axis into multiple subbands, rather than processing the entire spectrum with a wide full-band backbone. A shared subband encoder then maps each subband to a compact latent representation and learns the relationship between acoustic structure and time-frequency patterns. Finally, a lightweight regression head converts the fused subband representations into the estimated distance. Extensive simulation and real-world experiments demonstrate the merits of the proposed method. To benefit the broader research community, we have open-sourced our code at https://github.com/JiangWAV/FAST-SDE.
翻译:声源距离估计是人机交互中的关键能力。不恰当的交互距离不仅会降低语音获取与理解的可靠性,还会损害交互的自然性与舒适度。现有大多数声源距离估计方法依赖于麦克风阵列,然而多麦克风系统通常需要精密的硬件同步、几何校准以及额外的空间与计算资源,这限制了其在尺寸受限和计算能力受限的具身平台上的适用性。为缓解这些问题,我们提出Fast-SDE,一种轻量级单麦克风声源距离估计框架,适用于计算资源有限且尺寸严格受限的机器人平台。具体而言,Fast-SDE采用基于子带的骨干网络,将频率轴分解为多个子带,而非使用宽全频带骨干处理整个频谱。随后,共享子带编码器将每个子带映射为紧凑的潜在表征,并学习声学结构与时频模式之间的关系。最后,轻量级回归头将融合后的子带表征转换为估计距离。大量仿真与真实世界实验证明了所提方法的优势。为惠及更广泛的研究社区,我们已在https://github.com/JiangWAV/FAST-SDE开源代码。