This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram, interaural phase difference - IPD). Evaluations on in-domain and out-of-domain data with mismatched head-related transfer functions (HRTFs) reveal that carefully chosen feature combinations often outperform increases in model complexity. While two-feature sets such as ILD + IPD are sufficient for in-domain SSL, generalization to diverse content requires richer inputs combining channel spectrograms with both ILD and IPD. Using the optimal feature sets, our low-complexity CNN model achieves competitive performance. Our findings underscore the importance of feature design in binaural SSL and provide practical guidance for both domain-specific and general-purpose localization.
翻译:本研究对双耳声源定位中的时频特征设计进行了系统性评估,重点探讨了特征选择如何在多种条件下影响模型性能。我们采用卷积神经网络模型,研究了基于幅度的特征(幅度谱图、双耳声级差)与基于相位的特征(相位谱图、双耳相位差)的多种组合性能。对域内数据和域外数据(使用不匹配的头相关传递函数)的评估表明,精心选择的特征组合通常比增加模型复杂度更有效。虽然ILD+IPD等双特征集足以满足域内双耳声源定位需求,但泛化至多样化内容需要将通道谱图与ILD和IPD相结合的更丰富输入。采用最优特征集后,我们的低复杂度CNN模型实现了具有竞争力的性能。研究结果强调了特征设计在双耳声源定位中的重要性,并为领域特定和通用定位场景提供了实用指导。