Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and security. In this paper, we present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed ``follow anything'' (FAn), is an open-vocabulary and multimodal model -- it is not restricted to concepts seen at training time and can be applied to novel classes at inference time using text, images, or click queries. Leveraging rich visual descriptors from large-scale pre-trained models (foundation models), FAn can detect and segment objects by matching multimodal queries (text, images, clicks) against an input image sequence. These detected and segmented objects are tracked across image frames, all while accounting for occlusion and object re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial vehicle) and report its ability to seamlessly follow the objects of interest in a real-time control loop. FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second. To enable rapid adoption, deployment, and extensibility, we open-source all our code on our project webpage at https://github.com/alaamaalouf/FollowAnything . We also encourage the reader to watch our 5-minutes explainer video in this https://www.youtube.com/watch?v=6Mgt3EPytrw .
翻译:跟踪和跟随感兴趣的对象对于诸多机器人应用场景至关重要,涵盖从工业自动化、物流仓储到医疗保健和安全等领域。本文提出了一种能够实时检测、跟踪和跟随任意物体的机器人系统。我们的方法名为“追踪任意物体”(FAn),是一种开放词汇的多模态模型——它不局限于训练时所见的概念,可在推理阶段通过文本、图像或点击查询应用于新类别。FAn利用大规模预训练模型(基础模型)的丰富视觉描述符,通过将多模态查询(文本、图像、点击)与输入图像序列进行匹配,实现对目标的检测与分割。这些检测和分割出的目标在图像帧间被持续跟踪,同时处理遮挡和物体重新出现的情况。我们在真实机器人系统(微型飞行器)上展示了FAn,并报告了其在实时控制回路中无缝跟随感兴趣目标的能力。FAn可部署在配备轻量级(6-8GB)显卡的笔记本电脑上,实现每秒6-20帧的吞吐量。为促进快速采用、部署和扩展,我们在项目网页https://github.com/alaamaalouf/FollowAnything 上开源了所有代码。同时,建议读者观看我们在https://www.youtube.com/watch?v=6Mgt3EPytrw 上提供的5分钟解释视频。