Probing heads map the representations learned from audio by a machine learning model to downstream task labels and are a key component in evaluating representation learning. Most bioacoustic benchmarks use a fixed, low-capacity probe, such as a linear layer on the final encoder layer. While this standardization enables model comparisons, it may bias results by overlooking the interaction between encoder features and probe design. In this work, we systematically study different probing strategies across two bioacoustic benchmarks, BEANs and BirdSet. We evaluate last- and multi-layer probing, across linear and attention probes. We show that larger probe heads that leverage time information have superior performance. Our results suggest that current benchmarks may misrepresent encoder quality when relying on a last-layer probing setup. Multi-layer probing improves downstream task performance across all tested models, while attention probing has superior performance to linear probing for transformer models.
翻译:探测头将机器学习模型从音频中学习到的表征映射到下游任务标签,是评估表征学习的关键组成部分。大多数生物声学基准采用固定且低容量的探测方式,例如在最终编码器层上使用线性层。虽然这种标准化便于模型比较,但可能因忽视编码器特征与探测设计之间的交互而导致结果偏差。本研究在BEANs和BirdSet两个生物声学基准上系统性地探究了不同探测策略。我们评估了基于线性探测与注意力探测的末层及多层探测方法。研究表明,利用时间信息的大型探测头具有更优性能。我们的结果提示,当前基准在依赖末层探测设置时可能错误表征编码器质量。在所有测试模型中,多层探测均能提升下游任务性能,而注意力探测在Transformer模型中的表现优于线性探测。