Positive-Unlabelled Active Learning to Curate a Dataset for Orca Resident Interpretation

This work presents the largest curation of Southern Resident Killer Whale (SRKW) acoustic data to date, also containing other marine mammals in their environment. We systematically search all available public archival hydrophone data within the SRKW habitat (over 30 years of audio data). The search consists of a weakly-supervised, positive-unlabelled, active learning strategy to identify all instances of marine mammals. The resulting transformer-based detectors outperform state-of-the-art detectors on the DEEPAL, DCLDE-2026, and two newly introduced expert-annotated datasets in terms of accuracy, energy efficiency, and speed. The detection model has a specificity of 0-28.8% at 95% sensitivity. Our multiclass species classifier obtains a top-1 accuracy of 42.1% (11 train classes, 4 test classes) and our ecotype classifier obtains a top-1 accuracy of 43.0% (4 train classes, 5 test classes) on the DCLDE-2026 dataset. We yield 919 hours of SRKW data, 230 hours of Bigg's orca data, 1374 hours of orca data from unlabelled ecotypes, 1501 hours of humpback data, 88 hours of sea lion data, 246 hours of pacific white-sided dolphin data, and over 784 hours of unspecified marine mammal data. This SRKW dataset is larger than DCLDE-2026, Ocean Networks Canada, and OrcaSound combined. The curated species labels are available under CC-BY 4.0 license, and the corresponding audio data are available under the licenses of the original owners. The comprehensive nature of this dataset makes it suitable for unsupervised machine translation, habitat usage surveys, and conservation endeavours for this critically endangered ecotype.

翻译：本研究呈现了迄今为止规模最大的南方居留型虎鲸声学数据整理成果，其中亦包含其生境中的其他海洋哺乳动物。我们系统检索了SRKW栖息地内所有可用的公共档案水听器数据（涵盖超过30年的音频数据）。该检索采用一种弱监督、正样本-未标记的主动学习策略，以识别所有海洋哺乳动物实例。最终构建的基于Transformer的检测器在DEEPAL、DCLDE-2026及两个新引入的专家标注数据集上，于准确率、能效和速度方面均优于现有最先进检测器。该检测模型在95%灵敏度下的特异性为0-28.8%。我们的多物种分类器在DCLDE-2026数据集上获得42.1%的top-1准确率（11个训练类，4个测试类），生态型分类器获得43.0%的top-1准确率（4个训练类，5个测试类）。我们共获得919小时SRKW数据、230小时比格虎鲸数据、1374小时未标记生态型虎鲸数据、1501小时座头鲸数据、88小时海狮数据、246小时太平洋斑纹海豚数据以及超过784小时未指定海洋哺乳动物数据。本SRKW数据集规模超过DCLDE-2026、加拿大海洋网络和OrcaSound数据总和。整理后的物种标签采用CC-BY 4.0许可协议提供，对应音频数据遵循原始所有者的许可协议。该数据集的全面性使其适用于无监督机器翻译、栖息地利用调查以及这一极危生态型的保护工作。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《深度学习技术在海战舰船声景分类中的应用研究》最新63页

专知会员服务

26+阅读 · 2025年5月20日