Monocular RGB cameras mounted on drones are widely used for wildlife monitoring, yet most analytical pipelines remain confined to two-dimensional image space, leaving geometric information in video underexploited. We present WildLIFT, a computational framework that integrates three-dimensional scene geometry from monocular drone video with open-vocabulary 2D instance segmentation to enable species-agnostic 3D detection and tracking. Oriented 3D bounding box labels with semantic face information enable quantitative assessment of viewpoint coverage and inter-animal occlusion, producing structured metadata for downstream ecological analyses. We validate the framework on 2,581 manually curated frames comprising over 6,700 3D detections across four large mammal species. WildLIFT maintains high identity consistency in multi-animal scenes and substantially reduces manual 3D annotation effort through keyframe-based refinement. By transforming standard drone footage into structured 3D and viewpoint-aware representations, WildLIFT extends the analytical utility of aerial wildlife datasets for behavioural research and population monitoring.
翻译:搭载于无人机上的单目RGB相机被广泛用于野生动物监测,然而大多数分析流程仍局限于二维图像空间,视频中的几何信息未得到充分利用。我们提出WildLIFT计算框架,该框架将单目无人机视频中的三维场景几何信息与开放词汇2D实例分割相结合,实现物种无关的3D检测与跟踪。带有语义面信息的有向三维包围框标注,能够对视角覆盖范围和动物间遮挡进行定量评估,为下游生态学分析生成结构化元数据。我们在包含四种大型哺乳动物、超过6,700个三维检测实例的2,581帧人工标注数据上验证了该框架性能。在多动物场景中,WildLIFT保持了高身份一致性,并通过基于关键帧的优化显著减少了人工三维标注工作量。通过将标准无人机视频转化为结构化的三维及视角感知表征,WildLIFT拓展了航空野生动物数据集在行为研究与种群监测中的分析效用。