Human-robot collaboration (HRC) combines the complementary strengths of humans and robots to improve task efficiency. However, many existing collaborative systems rely on hand-engineered pipelines, limiting their scalability and flexibility for new tasks. In this work, we show that models trained end-to-end with imitation learning, specifically vision-language-action (VLA) models, can support collaborative manipulation, and characterize the key factors affecting their real-world performance. We evaluate two state-of-the-art models and identify a failure mode of action-chunking policies in implicit HRC, where demonstration action leakage (i.e., action chunks crossing latent task transitions) can cause premature assistive behavior. We find that this issue increases with longer execution horizons and occurs in real-world collaborative VLA systems, such as when a robot attempts to hand over a tool before the person is ready. We propose an inference-time steering method to mitigate these erroneous assistive actions while preserving policy performance. Finally, through a 16-participant user study on a long-horizon collaborative assembly task, we show that steering enables a longer execution horizon while mitigating premature assistance, leading to faster collaboration and fewer failures compared to a shorter-horizon policy.
翻译:人机协作(HRC)结合了人和机器的互补优势以提高任务效率。然而,许多现有协作系统依赖手工设计的流程,限制了其对新任务的可扩展性和灵活性。在本工作中,我们展示了通过模仿学习进行端到端训练的模型(特别是视觉-语言-动作(VLA)模型)能够支持协作操作,并刻画了影响其真实世界性能的关键因素。我们评估了两种最先进的模型,并识别出行动分块策略在隐式HRC中的一种失败模式:演示动作泄漏(即动作分块跨越潜在任务转换)可能导致过早的辅助行为。我们发现该问题随着执行时域变长而加剧,并出现在真实世界的协作VLA系统中,例如机器人试图在人准备好之前移交工具。我们提出了一种推理时引导方法(steering method),以在保持策略性能的同时减少这些错误的辅助动作。最后,通过一项16名参与者在长时域协作组装任务上的用户研究,我们证明引导方法能够实现更长的执行时域,同时缓解过早辅助问题,从而相比短时域策略实现更快的协作和更少的失败。