Pretrained spatial audio encoders are increasingly used as general-purpose representations for perceptual tasks, yet their spatial encoding capabilities remain poorly understood. We introduce the Spatial Audio Representation Learning (SARL) benchmark, a controlled framework for evaluating spatial information in pretrained audio models. SARL probes source-level factors (azimuth, elevation, distance, class) and room-level factors (RT60, volume, shape). Experiments across diverse encoders reveal three patterns: input configuration and training paradigm shape spatial encoding; source factors are consistently easier to decode than room factors; and sensitivity analysis under controlled perturbations shows heterogeneous responses to source and room variation. These results reveal systematic biases in current pretrained audio representations. SARL is released as an open-source benchmark for reproducible evaluation of spatial audio representations.
翻译:预训练空间音频编码器越来越多地被用作感知任务的通用表征,但其空间编码能力仍未得到充分理解。我们提出了空间音频表征学习(SARL)基准测试,这是一个用于评估预训练音频模型中空间信息的受控框架。SARL 探究了源级因素(方位角、仰角、距离、类别)和房间级因素(混响时间 RT60、体积、形状)。跨多种编码器的实验揭示了三种模式:输入配置和训练范式塑造了空间编码;源级因素比房间级因素更容易解码;在受控扰动下的敏感性分析显示,对源和房间变化的响应具有异质性。这些结果揭示了当前预训练音频表征中存在的系统性偏差。SARL 作为一个开源基准测试发布,用于对空间音频表征进行可重复的评估。