Swiss DINO: Efficient and Versatile Vision Framework for On-device Personal Object Search

In this paper, we address a recent trend in robotic home appliances to include vision systems on personal devices, capable of personalizing the appliances on the fly. In particular, we formulate and address an important technical task of personal object search, which involves localization and identification of personal items of interest on images captured by robotic appliances, with each item referenced only by a few annotated images. The task is crucial for robotic home appliances and mobile systems, which need to process personal visual scenes or to operate with particular personal objects (e.g., for grasping or navigation). In practice, personal object search presents two main technical challenges. First, a robot vision system needs to be able to distinguish between many fine-grained classes, in the presence of occlusions and clutter. Second, the strict resource requirements for the on-device system restrict the usage of most state-of-the-art methods for few-shot learning and often prevent on-device adaptation. In this work, we propose Swiss DINO: a simple yet effective framework for one-shot personal object search based on the recent DINOv2 transformer model, which was shown to have strong zero-shot generalization properties. Swiss DINO handles challenging on-device personalized scene understanding requirements and does not require any adaptation training. We show significant improvement (up to 55%) in segmentation and recognition accuracy compared to the common lightweight solutions, and significant footprint reduction of backbone inference time (up to 100x) and GPU consumption (up to 10x) compared to the heavy transformer-based solutions.

翻译：本文针对机器人化家用电器领域的一个新兴趋势展开研究，即如何在个人设备上集成视觉系统，以实现设备的实时个性化定制。具体而言，我们提出并解决了一项关键技术任务——个性化物体搜索。该任务要求机器人化家电设备能够对捕获图像中的个人感兴趣物品进行定位与识别，且每类物品仅通过少量标注图像进行参考。这项任务对于需要处理个人视觉场景或操作特定个人物品（例如抓取或导航）的机器人化家用电器及移动系统至关重要。实践中，个性化物体搜索面临两大技术挑战：其一，机器人视觉系统需在存在遮挡和杂乱的场景中区分大量细粒度类别；其二，设备端系统的严格资源限制制约了多数先进小样本学习方法的应用，并常阻碍设备端自适应训练的实现。为此，我们提出Swiss DINO：一种基于近期提出的DINOv2 Transformer模型的简洁而高效的单样本个性化物体搜索框架。DINOv2模型已被证明具备强大的零样本泛化能力。Swiss DINO能够应对设备端个性化场景理解的严苛需求，且无需任何自适应训练。实验表明，相较于常见的轻量化解决方案，本框架在分割与识别准确率上取得显著提升（最高达55%）；而与基于重型Transformer的解决方案相比，其骨干网络推理时间（最高降低100倍）与GPU消耗（最高降低10倍）均实现显著缩减。