The ability for a machine learning model to cope with differences in training and deployment conditions--e.g. in the presence of distribution shift or the generalization to new classes altogether--is crucial for real-world use cases. However, most empirical work in this area has focused on the image domain with artificial benchmarks constructed to measure individual aspects of generalization. We present BIRB, a complex benchmark centered on the retrieval of bird vocalizations from passively-recorded datasets given focal recordings from a large citizen science corpus available for training. We propose a baseline system for this collection of tasks using representation learning and a nearest-centroid search. Our thorough empirical evaluation and analysis surfaces open research directions, suggesting that BIRB fills the need for a more realistic and complex benchmark to drive progress on robustness to distribution shifts and generalization of ML models.
翻译:机器学习模型应对训练与部署条件差异的能力——例如在分布偏移存在时或对新类别进行泛化——对实际应用场景至关重要。然而,该领域的大多数实证研究聚焦于图像领域,利用人工构建的基准来度量泛化能力的单一维度。我们提出BIRB,一个以鸟类鸣声检索为核心的复杂基准,其训练数据来自大规模公民科学语料库中的焦点录音,测试数据为被动记录的野外数据集。我们基于表征学习与最近质心搜索方法,为该任务集合构建了基线系统。通过详尽的实证评估与分析,我们揭示了开放式研究方向,表明BIRB能够填补对更真实、更复杂基准的需求,从而推动机器学习模型在分布偏移鲁棒性与泛化能力上的进展。