Robotic manipulation is often specified through language instructions or task identifiers, yet cluttered environments with similar objects are better handled by spatially indicating what to move and where to place it. Addressing the vision-centric challenge of object and goal specification, we present, to the best of our knowledge, the first formalization of Spatially Prompted Visual Trajectory Prediction (SP-VTP). This novel setting utilizes initial spatial prompts (like bounding boxes or points) to define task objectives, tasking the model with forecasting future end-effector trajectories from egocentric streams. To study this problem, we collect and annotate EgoSPT, a dataset of egocentric spatially prompted manipulation trajectories with first-frame object and target grounding annotations and recovered 3D end-effector motion. SP-VTP is challenging because the task specification is static, while the scene configuration evolves over time. To solve this problem, we propose SPOT(Spatially Prompted Object-Target Policy), which combines a task encoder for first-frame visual and coordinate spatial prompts, an observation encoder for current visual and history context, and a trajectory generator for future end-effector motion. Experiments under strict scene-level splits show that SPOT improves cross-scene trajectory prediction over non-prompted or single-source prompted baselines. Together, EgoSPT and SPOT establish a new spatial prompting problem SP-VTP, as a simple and scalable task condition for egocentric manipulation.
翻译:机器人操作通常通过语言指令或任务标识符来指定,但在物体相似、环境杂乱的情况下,通过空间指示"移动什么"和"放置何处"能更有效地处理操作任务。针对物体与目标指定的视觉中心挑战,我们首次正式定义了空间提示视觉轨迹预测(SP-VTP)。这一新设置利用初始空间提示(如边界框或关键点)来定义任务目标,要求模型从第一人称视频流中预测未来末端执行器轨迹。为研究该问题,我们构建并标注了EgoSPT数据集,包含第一人称空间提示操作轨迹,并配有首帧物体与目标关联标注及恢复的三维末端执行器运动。SP-VTP的挑战在于任务规范是静态的,而场景配置随时间动态变化。为解决该问题,我们提出SPOT(空间提示物体-目标策略),该模型融合了首帧视觉与坐标空间提示的任务编码器、当前视觉与历史情境的观测编码器,以及未来末端执行器运动的轨迹生成器。在严格的场景级划分实验下,SPOT相较于无提示或单源提示基线方法,显著提升了跨场景轨迹预测性能。EgoSPT与SPOT共同确立了SP-VTP这一新空间提示问题,为第一人称操作提供了简单且可扩展的任务条件。