Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers

Building on existing approaches, we revisit Human-in-the-Loop Object Retrieval, a task that consists of iteratively retrieving images containing objects of a class-of-interest, specified by a user-provided query. Starting from a large unlabeled image collection, the aim is to rapidly identify diverse instances of an object category relying solely on the initial query and the user's Relevance Feedback, with no prior labels. The retrieval process is formulated as a binary classification task, where the system continuously learns to distinguish between relevant and non-relevant images to the query, through iterative user interaction. This interaction is guided by an Active Learning loop: at each iteration, the system selects informative samples for user annotation, thereby refining the retrieval performance. This task is particularly challenging in multi-object datasets, where the object of interest may occupy only a small region of the image within a complex, cluttered scene. Unlike object-centered settings where global descriptors often suffice, multi-object images require more adapted, localized descriptors. In this work, we formulate and revisit the Human-in-the-Loop Object Retrieval task by leveraging pre-trained ViT representations, and addressing key design questions, including which object instances to consider in an image, what form the annotations should take, how Active Selection should be applied, and which representation strategies best capture the object's features. We compare several representation strategies across multi-object datasets highlighting trade-offs between capturing the global context and focusing on fine-grained local object details. Our results offer practical insights for the design of effective interactive retrieval pipelines based on Active Learning for object class retrieval.

翻译：基于现有方法，本文重新审视了人机协同目标检索任务——该任务通过迭代检索包含用户指定类别目标的图像序列。系统从大规模未标注图像集合出发，仅依赖初始查询与用户相关性反馈，在无预标注条件下快速识别目标类别的多样实例。检索过程被建模为二元分类任务：系统通过迭代式用户交互持续学习区分与查询相关/不相关的图像。该交互受主动学习循环驱动，每轮迭代中系统自动选取具有信息量的样本供用户标注，从而逐步优化检索性能。此任务在多目标数据集中尤为困难——感兴趣目标可能仅占据复杂杂乱场景中的微小区域。不同于全局描述符即可胜任的目标中心设定，多目标图像需要更适配的局部化描述符。本研究通过利用预训练ViT表征，系统探讨了人机协同目标检索中的关键设计问题：图像中应关注哪些目标实例？标注应采用何种形式？如何实施主动选择策略？哪种表征策略能最优捕获目标特征？我们对比了多种表征策略在多目标数据集上的表现，揭示出捕获全局上下文与聚焦细粒度局部细节之间的权衡。实验结果可为基于主动学习的目标类检索高效交互式流程设计提供实践指导。