Spoken instructions in robot-to-human handovers may specify either an object ("the cup") or an intended use ("pour water"); in both cases, successful handover requires the robot to infer the target object and the region remaining available for the human to hold. If the robot grasps that hold region, the object could become awkward to receive and immediately use, potentially reducing perceived competence and trust; if the gripper approaches too close to the receiving hand during delivery, perceived safety may also suffer. We present Intent-Handover, which grounds unconstrained speech and visual scene context into explicit grasp and delivery constraints. Given a spoken instruction and a scene observation, a vision-language model identifies the target object and the intended human-usage region. A grasp optimization module then selects a feasible grasp keeping this region accessible while enforcing clearance from the predicted receiving hand. During execution, the robot tracks upper-body key points to estimate the user's receiving pose and places the handover at an ergonomically feasible location. In a within-subjects ablation study (n=30), human-usage region awareness increases perceived trust, hand-gripper collision avoidance increases perceived safety, and interaction comfort is highest when both are enabled. Website and code: https://robot-future.github.io/intent-handover/.
翻译:暂无翻译