Self-Supervised Animal Identification for Long Videos

Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long sequences due to memory constraints and temporal error propagation. We introduce a highly efficient, self-supervised method that reframes animal identification as a global clustering task rather than a sequential tracking problem. Our approach assumes a known, fixed number of individuals within a single video -- a common scenario in practice -- and requires only bounding box detections and the total count. By sampling pairs of frames, using a frozen pre-trained backbone, and employing a self-bootstrapping mechanism with the Hungarian algorithm for in-batch pseudo-label assignment, our method learns discriminative features without identity labels. We adapt a Binary Cross Entropy loss from vision-language models, enabling state-of-the-art accuracy ($>$97\%) while consuming less than 1 GB of GPU memory per batch -- an order of magnitude less than standard contrastive methods. Evaluated on challenging real-world datasets (3D-POP pigeons and 8-calves feeding videos), our framework matches or surpasses supervised baselines trained on over 1,000 labeled frames, effectively removing the manual annotation bottleneck. This work enables practical, high-accuracy animal identification on consumer-grade hardware, with broad applicability in resource-constrained research settings. All code written for this paper are \href{https://huggingface.co/datasets/tonyFang04/8-calves}{here}.

翻译：在长时视频中识别个体动物对于行为生态学、野生动物监测和畜牧管理至关重要。传统方法需要大量人工标注，而现有的自监督方法由于内存限制和时间误差传播，计算需求高且不适用于长序列。我们提出了一种高效的自监督方法，将动物身份识别重新定义为全局聚类任务而非序列跟踪问题。我们的方法假设单个视频中存在已知且固定的个体数量——这是实践中的常见场景——并且仅需要边界框检测和总数量。通过采样帧对、使用冻结的预训练主干网络，并采用匈牙利算法进行批内伪标签分配的自引导机制，我们的方法无需身份标签即可学习判别性特征。我们借鉴了视觉语言模型中的二元交叉熵损失，实现了最先进的准确率（>97%），同时每批次GPU内存消耗低于1 GB——比标准对比方法低一个数量级。在具有挑战性的真实数据集（3D-POP鸽群和8头小牛进食视频）上的评估表明，我们的框架达到甚至超过了使用超过1,000个标注帧训练的监督基线，有效消除了人工标注瓶颈。这项工作使得在消费级硬件上实现实用、高精度的动物身份识别成为可能，在资源受限的研究环境中具有广泛适用性。本文所有代码均发布于\href{https://huggingface.co/datasets/tonyFang04/8-calves}{此处}。