This study addresses the application of deep learning techniques in joint sound signal classification and localization networks. Current state-of-the-art sound source localization deep learning networks lack feature aggregation within their architecture. Feature aggregation enhances model performance by enabling the consolidation of information from different feature scales, thereby improving feature robustness and invariance. This is particularly important in SSL networks, which must differentiate direct and indirect acoustic signals. To address this gap, we adapt feature aggregation techniques from computer vision neural networks to signal detection neural networks. Additionally, we propose the Scale Encoding Network (SEN) for feature aggregation to encode features from various scales, compressing the network for more computationally efficient aggregation. To evaluate the efficacy of feature aggregation in SSL networks, we integrated the following computer vision feature aggregation sub-architectures into a SSL control architecture: Path Aggregation Network (PANet), Weighted Bi-directional Feature Pyramid Network (BiFPN), and SEN. These sub-architectures were evaluated using two metrics for signal classification and two metrics for direction-of-arrival regression. PANet and BiFPN are established aggregators in computer vision models, while the proposed SEN is a more compact aggregator. The results suggest that models incorporating feature aggregations outperformed the control model, the Sound Event Localization and Detection network (SELDnet), in both sound signal classification and localization. The feature aggregation techniques enhance the performance of sound detection neural networks, particularly in direction-of-arrival regression.
翻译:本研究探讨了深度学习技术在联合声音信号分类与定位网络中的应用。当前最先进的声音源定位深度学习网络在架构中缺乏特征聚合机制。特征聚合通过整合不同尺度的信息来提升模型性能,从而增强特征的鲁棒性与不变性。这对于需要区分直达声与非直达声的SSL网络尤为关键。为弥补这一不足,我们将计算机视觉神经网络中的特征聚合技术迁移至信号检测神经网络。此外,我们提出了尺度编码网络(Scale Encoding Network,SEN)用于特征聚合,能够编码多尺度特征并压缩网络以实现更高计算效率的聚合。为评估特征聚合在SSL网络中的效能,我们将以下计算机视觉特征聚合子架构集成至SSL控制架构中:路径聚合网络(Path Aggregation Network,PANet)、加权双向特征金字塔网络(Weighted Bi-directional Feature Pyramid Network,BiFPN)以及SEN。这些子架构通过两个信号分类指标和两个到达方向回归指标进行评估。PANet与BiFPN是计算机视觉模型中成熟的聚合器,而所提出的SEN则是更为紧凑的聚合器。实验结果表明,集成特征聚合的模型在声音信号分类与定位任务中均优于控制模型——声音事件定位与检测网络(SELDnet)。特征聚合技术能有效提升声音检测神经网络的性能,尤其在到达方向回归方面表现显著。