While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.
翻译:尽管视觉-语言-动作模型在通用指令上泛化性能良好,但在处理如“递我的杯子”这类个性化指令时却面临困难——机器人必须在视觉相似的物体中识别并操作某个特定实例。本文研究这种个性化物体操控场景:模型需仅凭少量参考图像,在训练中未见过的条件下识别并操控用户特定的物体。为应对这一挑战,我们提出视觉注意提示方法——一种简单高效的无训练感知适配器,能为冻结的视觉-语言-动作模型赋予自上而下的选择性注意力。该方法将参考图像视为非参数化视觉记忆,通过开放词汇检测和基于嵌入的匹配在场景中定位个性化物体,随后通过高亮目标物体并重写指令的方式,将定位结果作为视觉提示注入模型。我们构建了两个仿真基准测试集Personalized-SIMPLER与Personalized-VLABench,以及一个真实世界桌面操作基准,用于评估跨多机器人平台与任务的个性化操控性能。实验表明,视觉注意提示方法在任务成功率与正确物体操控率上均持续优于通用策略及基于令牌学习的基线方法,有助于弥合语义理解与实例级控制之间的鸿沟。