We introduce Vocal Sandbox, a framework for enabling seamless human-robot collaboration in situated environments. Systems in our framework are characterized by their ability to adapt and continually learn at multiple levels of abstraction from diverse teaching modalities such as spoken dialogue, object keypoints, and kinesthetic demonstrations. To enable such adaptation, we design lightweight and interpretable learning algorithms that allow users to build an understanding and co-adapt to a robot's capabilities in real-time, as they teach new behaviors. For example, after demonstrating a new low-level skill for "tracking around" an object, users are provided with trajectory visualizations of the robot's intended motion when asked to track a new object. Similarly, users teach high-level planning behaviors through spoken dialogue, using pretrained language models to synthesize behaviors such as "packing an object away" as compositions of low-level skills $-$ concepts that can be reused and built upon. We evaluate Vocal Sandbox in two settings: collaborative gift bag assembly and LEGO stop-motion animation. In the first setting, we run systematic ablations and user studies with 8 non-expert participants, highlighting the impact of multi-level teaching. Across 23 hours of total robot interaction time, users teach 17 new high-level behaviors with an average of 16 novel low-level skills, requiring 22.1% less active supervision compared to baselines and yielding more complex autonomous performance (+19.7%) with fewer failures (-67.1%). Qualitatively, users strongly prefer Vocal Sandbox systems due to their ease of use (+20.6%) and overall performance (+13.9%). Finally, we pair an experienced system-user with a robot to film a stop-motion animation; over two hours of continuous collaboration, the user teaches progressively more complex motion skills to shoot a 52 second (232 frame) movie.
翻译:本文提出声学沙盒框架,旨在实现情境环境中无缝的人机协作。该框架中的系统具备多层级抽象适应与持续学习能力,能够从语音对话、物体关键点示教、动觉演示等多种教学模态中获取知识。为实现这种适应性,我们设计了轻量级可解释的学习算法,使用户在教授新行为时能够实时理解机器人能力并实现协同适应。例如,在演示"环绕跟踪"物体的底层技能后,当用户要求跟踪新物体时,系统会提供机器人预期运动轨迹的可视化结果。同样地,用户可通过语音对话教授高层规划行为,利用预训练语言模型将"收纳物体"等行为合成为底层技能的组合——这些可复用的概念能够持续扩展。我们在两个场景中评估声学沙盒:协作礼品袋组装与乐高定格动画制作。在第一场景中,我们对8位非专业参与者进行系统消融实验与用户研究,揭示了多层级教学的影响。在总计23小时的机器人交互中,用户教授了17种新的高层行为(平均包含16项新颖底层技能),相比基线方法减少22.1%的主动监督需求,获得更复杂的自主表现(+19.7%)且故障率降低67.1%。定性分析显示用户高度青睐声学沙盒系统,因其易用性(+20.6%)与整体性能(+13.9%)显著提升。最后,我们让经验丰富的系统用户与机器人合作拍摄定格动画:在连续两小时的协作中,用户逐步教授复杂的运动技能,最终完成52秒(232帧)影片的拍摄。