Prior robot painting and drawing work, such as FRIDA, has focused on decreasing the sim-to-real gap and expanding input modalities for users, but the interaction with these systems generally exists only in the input stages. To support interactive, human-robot collaborative painting, we introduce the Collaborative FRIDA (CoFRIDA) robot painting framework, which can co-paint by modifying and engaging with content already painted by a human collaborator. To improve text-image alignment, FRIDA's major weakness, our system uses pre-trained text-to-image models; however, pre-trained models in the context of real-world co-painting do not perform well because they (1) do not understand the constraints and abilities of the robot and (2) cannot perform co-painting without making unrealistic edits to the canvas and overwriting content. We propose a self-supervised fine-tuning procedure that can tackle both issues, allowing the use of pre-trained state-of-the-art text-image alignment models with robots to enable co-painting in the physical world. Our open-source approach, CoFRIDA, creates paintings and drawings that match the input text prompt more clearly than FRIDA, both from a blank canvas and one with human created work. More generally, our fine-tuning procedure successfully encodes the robot's constraints and abilities into a foundation model, showcasing promising results as an effective method for reducing sim-to-real gaps.
翻译:先前的机器人绘画工作,如FRIDA,主要致力于缩小仿真到现实的差距并扩展用户的输入模态,但这些系统与用户的交互通常仅存在于输入阶段。为了支持交互式人机协同绘画,我们提出了协作式FRIDA(CoFRIDA)机器人绘画框架,该框架能够通过修改和介入人类协作者已绘制的画作内容实现协同绘画。针对FRIDA在文本-图像对齐方面的主要弱点,本系统采用预训练文本生成图像模型;然而,预训练模型在实际协同绘画场景中表现不佳,原因在于:(1)它们不理解机器人的约束条件与能力限制;(2)无法在不进行不切实际编辑或覆盖已有画作内容的前提下完成协同绘画。我们提出一种自监督微调流程,可同时解决上述两个问题,使机器人能够结合预训练的先进文本-图像对齐模型,在真实物理世界中实现协同绘画。与FRIDA相比,我们的开源方法CoFRIDA在空白画布和存在人类作品画布上生成的绘画作品均能更清晰地匹配输入文本提示。更广泛地,该微调流程成功将机器人的约束条件与能力编码到基础模型中,展现出作为缩小仿真到现实差距有效方法的巨大潜力。