The recent widespread adoption of drones for studying marine animals provides opportunities for deriving biological information from aerial imagery. The large scale of imagery data acquired from drones is well suited for machine learning (ML) analysis. Development of ML models for analyzing marine animal aerial imagery has followed the classical paradigm of training, testing, and deploying a new model for each dataset, requiring significant time, human effort, and ML expertise. We introduce Frame Level ALIgment and tRacking (FLAIR), which leverages the video understanding of Segment Anything Model 2 (SAM2) and the vision-language capabilities of Contrastive Language-Image Pre-training (CLIP). FLAIR takes a drone video as input and outputs segmentation masks of the species of interest across the video. Notably, FLAIR leverages a zero-shot approach, eliminating the need for labeled data, training a new model, or fine-tuning an existing model to generalize to other species. With a dataset of 18,000 drone images of Pacific nurse sharks, we trained state-of-the-art object detection models to compare against FLAIR. We show that FLAIR massively outperforms these object detectors and performs competitively against two human-in-the-loop methods for prompting SAM2, achieving a Dice score of 0.81. FLAIR readily generalizes to other shark species without additional human effort and can be combined with novel heuristics to automatically extract relevant information including length and tailbeat frequency. FLAIR has significant potential to accelerate aerial imagery analysis workflows, requiring markedly less human effort and expertise than traditional machine learning workflows, while achieving superior accuracy. By reducing the effort required for aerial imagery analysis, FLAIR allows scientists to spend more time interpreting results and deriving insights about marine ecosystems.
翻译:近年来,无人机在研究海洋动物中的广泛应用为从空中影像中提取生物信息提供了机遇。无人机获取的大规模影像数据非常适合机器学习分析。针对海洋动物空中影像分析的机器学习模型开发一直遵循经典范式:为每个数据集训练、测试并部署新模型,这需要大量时间、人力投入和机器学习专业知识。我们提出了帧级对齐与追踪模型,该模型结合了Segment Anything Model 2的视频理解能力和对比语言-图像预训练模型的视觉-语言能力。FLAIR以无人机视频为输入,输出目标物种在整个视频中的分割掩码。值得注意的是,FLAIR采用零样本方法,无需标注数据、训练新模型或微调现有模型即可泛化至其他物种。基于包含18,000张太平洋护士鲨无人机图像的数据集,我们训练了先进的目标检测模型与FLAIR进行对比。实验表明,FLAIR大幅优于这些目标检测器,并与两种人工参与的SAM2提示方法性能相当,Dice系数达到0.81。FLAIR无需额外人力即可泛化至其他鲨鱼物种,并能结合新颖的启发式方法自动提取包括体长和尾鳍摆动频率在内的相关信息。FLAIR具有显著加速空中影像分析流程的潜力,相较于传统机器学习流程,其所需的人力和专业知识大幅减少,同时实现了更优的精度。通过降低空中影像分析所需的工作量,FLAIR使科学家能够将更多时间用于结果解读和海洋生态系统洞察。