Interactive perception (IP) enables robots to extract hidden information in their workspace and execute manipulation plans by physically interacting with objects and altering the state of the environment -- crucial for resolving occlusions and ambiguity in complex, partially observable scenarios. We present Zero-Shot IP (ZS-IP), a novel framework that couples multi-strategy manipulation (pushing and grasping) with a memory-driven Vision Language Model (VLM) to guide robotic interactions and resolve semantic queries. ZS-IP integrates three key components: (1) an Enhanced Observation (EO) module that augments the VLM's visual perception with both conventional keypoints and our proposed pushlines -- a novel 2D visual augmentation tailored to pushing actions, (2) a memory-guided action module that reinforces semantic reasoning through context lookup, and (3) a robotic controller that executes pushing, pulling, or grasping based on VLM output. Unlike grid-based augmentations optimized for pick-and-place, pushlines capture affordances for contact-rich actions, substantially improving pushing performance. We evaluate ZS-IP on a 7-DOF Franka Panda arm across diverse scenes with varying occlusions and task complexities. Our experiments demonstrate that ZS-IP outperforms passive and viewpoint-based perception techniques such as Mark-Based Visual Prompting (MOKA), particularly in pushing tasks, while preserving the integrity of non-target elements.
翻译:交互感知(IP)使机器人能够通过物理交互改变物体与环境状态,从而提取工作空间中隐藏的信息并执行操控规划——这对于解决复杂、部分可观测场景中的遮挡与模糊性问题至关重要。本文提出零样本交互感知(ZS-IP),这是一个将多策略操控(推动与抓取)与记忆驱动的视觉语言模型(VLM)相结合的新型框架,用于引导机器人交互并解析语义查询。ZS-IP包含三个核心组件:(1)增强观测(EO)模块,通过传统关键点与我们提出的推进行迹(一种专为推动动作设计的二维视觉增强表征)来提升VLM的视觉感知能力;(2)记忆引导动作模块,通过上下文检索强化语义推理;(3)机器人控制器,根据VLM输出执行推动、拉动或抓取动作。与针对抓放任务优化的网格增强方法不同,推进行迹能够捕捉密集接触动作的功能特性,显著提升推动任务性能。我们在配备7自由度Franka Panda机械臂的多样化场景中评估ZS-IP,这些场景具有不同程度的遮挡与任务复杂度。实验表明,ZS-IP在推动任务中显著优于被动感知与基于视角的感知方法(如Mark-Based Visual Prompting (MOKA)),同时能保持非目标元素的完整性。