Tracking and following objects of interest is critical to several robotics use cases, ranging from industrial automation to logistics and warehousing, to healthcare and security. In this paper, we present a robotic system to detect, track, and follow any object in real-time. Our approach, dubbed ``follow anything'' (FAn), is an open-vocabulary and multimodal model -- it is not restricted to concepts seen at training time and can be applied to novel classes at inference time using text, images, or click queries. Leveraging rich visual descriptors from large-scale pre-trained models (foundation models), FAn can detect and segment objects by matching multimodal queries (text, images, clicks) against an input image sequence. These detected and segmented objects are tracked across image frames, all while accounting for occlusion and object re-emergence. We demonstrate FAn on a real-world robotic system (a micro aerial vehicle) and report its ability to seamlessly follow the objects of interest in a real-time control loop. FAn can be deployed on a laptop with a lightweight (6-8 GB) graphics card, achieving a throughput of 6-20 frames per second. To enable rapid adoption, deployment, and extensibility, we open-source all our code on our project webpage at https://github.com/alaamaalouf/FollowAnything . We also encourage the reader the watch our 5-minutes explainer video in this https://www.youtube.com/watch?v=6Mgt3EPytrw .
翻译:跟踪和跟随感兴趣对象是许多机器人应用场景的关键,涵盖工业自动化、物流仓储以及医疗和安全领域。本文提出一种能够实时检测、跟踪并跟随任意对象的机器人系统。我们的方法名为“Follow Anything”(FAn),是一种开放词汇多模态模型——它不受训练时所见概念的限制,可在推理时通过文本、图像或点击查询应用于新类别。利用大规模预训练模型(基础模型)提供的丰富视觉描述符,FAn能够通过将多模态查询(文本、图像、点击)与输入图像序列匹配,检测并分割对象。这些检测并分割出的对象在图像帧间被跟踪,同时处理遮挡和对象重新出现的情况。我们在真实机器人系统(微型空中飞行器)上演示了FAn,并报告了其在实时控制回路中无缝跟随感兴趣对象的能力。FAn可在配备轻量级(6-8 GB)显卡的笔记本电脑上部署,实现每秒6-20帧的吞吐量。为促进快速采用、部署和扩展,我们在项目网页(https://github.com/alaamaalouf/FollowAnything)上开源了全部代码。同时,我们鼓励读者观看此链接中的5分钟解释视频:https://www.youtube.com/watch?v=6Mgt3EPytrw。