紧凑超立方体嵌入用于基于文本的野生动物观测快速检索 (Compact Hypercube Embeddings for Fast Text-based Wildlife Observation Retrieval)

Large-scale biodiversity monitoring platforms increasingly rely on multimodal wildlife observations. While recent foundation models enable rich semantic representations across vision, audio, and language, retrieving relevant observations from massive archives remains challenging due to the computational cost of high-dimensional similarity search. In this work, we introduce compact hypercube embeddings for fast text-based wildlife observation retrieval, a framework that enables efficient text-based search over large-scale wildlife image and audio databases using compact binary representations. Building on the cross-view code alignment hashing framework, we extend lightweight hashing beyond a single-modality setup to align natural language descriptions with visual or acoustic observations in a shared Hamming space. Our approach leverages pretrained wildlife foundation models, including BioCLIP and BioLingual, and adapts them efficiently for hashing using parameter-efficient fine-tuning. We evaluate our method on large-scale benchmarks, including iNaturalist2024 for text-to-image retrieval and iNatSounds2024 for text-to-audio retrieval, as well as multiple soundscape datasets to assess robustness under domain shift. Results show that retrieval using discrete hypercube embeddings achieves competitive, and in several cases superior, performance compared to continuous embeddings, while drastically reducing memory and search cost. Moreover, we observe that the hashing objective consistently improves the underlying encoder representations, leading to stronger retrieval and zero-shot generalization. These results demonstrate that binary, language-based retrieval enables scalable and efficient search over large wildlife archives for biodiversity monitoring systems.

翻译：大规模生物多样性监测平台日益依赖多模态野生动物观测数据。尽管近期的基础模型能够实现跨视觉、音频和语言的丰富语义表示，但由于高维相似性搜索的计算成本，从海量档案中检索相关观测数据仍然具有挑战性。本研究提出一种用于基于文本的野生动物观测快速检索的紧凑超立方体嵌入框架，该框架利用紧凑二进制表示实现大规模野生动物图像和音频数据库的高效文本搜索。基于跨视图代码对齐哈希框架，我们将轻量级哈希方法从单模态设置扩展到多模态场景，使自然语言描述与视觉或听觉观测在共享汉明空间中对齐。本方法利用预训练的野生动物基础模型（包括BioCLIP和BioLingual），并通过参数高效微调技术将其适配于哈希任务。我们在多个大规模基准数据集上评估该方法，包括用于文本-图像检索的iNaturalist2024和用于文本-音频检索的iNatSounds2024，以及多个声景数据集以评估领域偏移下的鲁棒性。实验结果表明，与连续嵌入相比，使用离散超立方体嵌入的检索方法在保持竞争力的同时，在多个案例中实现更优性能，并显著降低内存和搜索成本。此外，我们发现哈希目标函数能持续改进底层编码器表示，从而获得更强的检索能力和零样本泛化性能。这些结果证明，基于语言的二进制检索方法能为生物多样性监测系统提供可扩展且高效的大型野生动物档案搜索方案。