This study addresses the application of deep learning techniques in joint sound signal classification and localization networks. Current state-of-the-art sound source localization deep learning networks lack feature aggregation within their architecture. Feature aggregation enhances model performance by enabling the consolidation of information from different feature scales, thereby improving feature robustness and invariance. This is particularly important in SSL networks, which must differentiate direct and indirect acoustic signals. To address this gap, we adapt feature aggregation techniques from computer vision neural networks to signal detection neural networks. Additionally, we propose the Scale Encoding Network (SEN) for feature aggregation to encode features from various scales, compressing the network for more computationally efficient aggregation. To evaluate the efficacy of feature aggregation in SSL networks, we integrated the following computer vision feature aggregation sub-architectures into a SSL control architecture: Path Aggregation Network (PANet), Weighted Bi-directional Feature Pyramid Network (BiFPN), and SEN. These sub-architectures were evaluated using two metrics for signal classification and two metrics for direction-of-arrival regression. PANet and BiFPN are established aggregators in computer vision models, while the proposed SEN is a more compact aggregator. The results suggest that models incorporating feature aggregations outperformed the control model, the Sound Event Localization and Detection network (SELDnet), in both sound signal classification and localization. The feature aggregation techniques enhance the performance of sound detection neural networks, particularly in direction-of-arrival regression.
翻译:本研究探讨了深度学习技术在联合声音信号分类与定位网络中的应用。当前最先进的声音源定位深度学习网络架构中缺乏特征聚合机制。特征聚合通过整合不同特征尺度的信息来提升模型性能,从而增强特征的鲁棒性与不变性。这一特性在需要区分直达声与间接声信号的SSL网络中尤为重要。为填补这一研究空白,我们将计算机视觉神经网络中的特征聚合技术迁移至信号检测神经网络。同时提出尺度编码网络(SEN)用于特征聚合,通过编码多尺度特征实现网络压缩,从而提升聚合过程的计算效率。为评估特征聚合在SSL网络中的有效性,我们将以下计算机视觉特征聚合子架构集成至SSL控制架构:路径聚合网络(PANet)、加权双向特征金字塔网络(BiFPN)及SEN。通过两项信号分类指标与两项声达方向回归指标对上述子架构进行评测。PANet与BiFPN是计算机视觉模型中成熟的聚合器,而本文提出的SEN是更为紧凑的聚合器。结果表明,采用特征聚合的模型在声音信号分类与定位任务中均优于基线模型——声音事件定位与检测网络(SELDnet)。特征聚合技术能有效提升声音检测神经网络的性能,尤其在声达方向回归方面表现突出。