Recent spatial self supervised audio models achieve high performance on localization tasks, raising questions about their encoding of microsecond interaural phase fine structures. We propose a psychoacoustic benchmark based on the binaural masking level difference to evaluate this. Using an equalization cancellation baseline and a GCC PHAT positive control we evaluate nine frozen audio models spanning binaural SSL, monaural SSL, and neural audio codecs. Four monaural negative controls yield zero BMLD confirming binaural specificity. Two general purpose binaural SSL models exhibit minimal phase sensitivity while dedicated binaural spatial SSL models achieve BMLD comparable to the analytical baseline. Progressive physical ablations show that general purpose binaural SSL models rely on spectro temporal interference textures rather than cross channel phase computation. High detection rates in speech reflect a confounding reliance on broadband envelopes rather than genuine phase encoding.
翻译:最近的空间自监督音频模型在定位任务上表现出色,这引发了关于它们对微秒级耳间相位精细结构编码能力的疑问。我们基于双耳掩蔽级差提出了一种心理声学基准来评估这一点。利用一个均衡抵消基线和GCC-PHAT正对照,我们评估了九种冻结音频模型,涵盖双耳SSL、单耳SSL和神经音频编解码器。四个单耳负对照产生了零BMLD,证实了双耳特异性。两个通用双耳SSL模型表现出最小相位敏感性,而专用双耳空间SSL模型达到了与解析基线相当的BMLD。渐进式物理消融实验表明,通用双耳SSL模型依赖于频谱-时间干涉纹理而非跨通道相位计算。语音中的高检测率反映了对宽带包络而非真正相位编码的混淆性依赖。