While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup", where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.
翻译:尽管视觉-语言-动作(VLA)模型能够较好地泛化至通用指令,但在处理如“递我的杯子”这类个性化指令时仍面临困难——机器人需在视觉相似的物体中识别并操作特定实例。本研究聚焦于个性化物体操控场景:VLA模型需仅凭少量参考图像,识别并操控训练阶段未见过、属于特定用户的物体。为解决这一挑战,我们提出视觉注意力提示(VAP),这是一种简单而有效的免训练感知适配器,能为冻结的VLA模型注入自上而下的选择性注意力。VAP将参考图像视为非参数化视觉记忆,通过开放词汇检测与基于嵌入的匹配在场景中定位个性化物体,随后通过高亮目标物体并重写指令的方式,将该定位信息作为视觉提示注入模型。我们构建了两个模拟基准测试集(Personalized-SIMPLER与Personalized-VLABench)及一个真实世界桌面操作基准,用于评估跨多机器人平台与任务类型的个性化操控性能。实验表明,VAP在任务成功率与正确物体操作率上均持续优于通用策略及基于令牌学习的基线方法,有助于弥合语义理解与实例级控制之间的鸿沟。