Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval

Large-scale biodiversity monitoring platforms increasingly rely on multimodal wildlife observations. While recent foundation models enable rich semantic representations across vision, audio, and language, retrieving relevant observations from massive archives remains challenging due to the computational cost of high-dimensional similarity search. In this work, we introduce compact hypercube embeddings for fast text-based wildlife observation retrieval, a framework that enables efficient text-based search over large-scale wildlife image and audio databases using compact binary representations. Building on the cross-view code alignment hashing framework, we extend lightweight hashing beyond a single-modality setup to align natural language descriptions with visual or acoustic observations in a shared Hamming space. Our approach leverages pretrained wildlife foundation models, including BioCLIP and BioLingual, and adapts them efficiently for hashing using parameter-efficient fine-tuning. We evaluate our method on large-scale benchmarks, including iNaturalist2024 for text-to-image retrieval and iNatSounds2024 for text-to-audio retrieval, as well as multiple soundscape datasets to assess robustness under domain shift. Results show that retrieval using discrete hypercube embeddings achieves competitive, and in several cases superior, performance compared to continuous embeddings, while drastically reducing memory and search cost. Moreover, we observe that the hashing objective consistently improves the underlying encoder representations, leading to stronger retrieval and zero-shot generalization. These results demonstrate that binary, language-based retrieval enables scalable and efficient search over large wildlife archives for biodiversity monitoring systems.

翻译：大规模生物多样性监测平台日益依赖于多模态野生动物观测数据。尽管近期的基础模型能够在视觉、音频和语言维度生成丰富的语义表征，但由于高维相似性搜索的计算成本，从海量档案中检索相关观测数据仍具有挑战性。本研究提出紧凑超立方体嵌入以实现快速文本驱动的野生动物观测检索——该框架通过紧凑二进制表征，能够对大规模野生动物图像与音频数据库执行高效的文本驱动搜索。基于跨视角编码对齐哈希框架，我们将轻量级哈希从单模态设置扩展至多模态场景，使自然语言描述与视觉或声学观测在共享汉明空间中实现对齐。方法采用包括BioCLIP与BioLingual在内的预训练野生动物基础模型，并通过参数高效微调将其高效适配至哈希任务。我们在包含用于文本-图像检索的iNaturalist2024、用于文本-音频检索的iNatSounds2024等多个大规模基准数据集及多个声景数据集上评估方法性能，以检验域迁移下的鲁棒性。结果表明，基于离散超立方体嵌入的检索在性能上媲美连续嵌入，并在多个场景中实现更优表现，同时显著降低存储与搜索成本。此外，我们观察到哈希目标函数持续改进底层编码器表征，从而增强检索与零样本泛化能力。这些结果证明，基于二进制表征的语言驱动检索可为生物多样性监测系统提供对大规模野生动物档案的可扩展高效搜索方案。