Passive acoustic monitoring (PAM) systems generate continuous recordings spanning months, yet automated bioacoustic analysis of whale calls requires two separate annotation efforts: binary presence labels for classification and precise temporal boundaries for localization. A binary label for a multi-minute recording can be assigned in seconds, but timestamping every call within it requires hours of expert effort. Providing both is infeasible at operational scale. We present DSMIL-LocNet, a weakly supervised multiple instance learning (MIL) framework that performs both classification and temporal localization using only recording-level presence/absence labels. Our dual-stream architecture integrates spectral and temporal features to process recordings of 2--30 minutes without the temporal compression that degrades existing CNN methods on long inputs. On the AcousticTrends BlueFinLibrary, DSMIL-LocNet achieves F1 scores of 0.88--0.91 on recordings of 300--1800s, where fully supervised CNN baselines degrade to 0.19--0.64. It also provides temporal localization that these baselines cannot produce without frame-level annotation. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc
翻译:被动声学监测(PAM)系统可生成持续数月连续录音,但鲸鱼叫声的自动化生物声学分析需要两类独立的标注工作:用于分类的二元存在性标签和用于定位的精确时间边界。多分钟录音的二元标签可在数秒内完成标注,但标注其中每段叫声则需要数小时的专业人力。在规模化运营中同时实现两者不可行。我们提出DSMIL-LocNet,一种弱监督多实例学习(MIL)框架,仅利用录音级别的存在/不存在标签即可同时完成分类与时间定位。我们的双流架构融合频谱与时间特征,可处理2-30分钟录音,且无需采用导致现有CNN方法在长输入中性能下降的时间压缩策略。在AcousticTrends BlueFinLibrary数据集上,DSMIL-LocNet在处理300-1800秒录音时达成0.88-0.91的F1分数,而全监督CNN基线方法性能降至0.19-0.64。该框架还提供了这些基线方法在没有帧级标注情况下无法生成的时间定位功能。代码:https://github.com/Ragib-Amin-Nihal/DSMIL-Loc