This research introduces the Bi-VLA (Vision-Language-Action) model, a novel system designed for bimanual robotic dexterous manipulation that seamlessly integrates vision for scene understanding, language comprehension for translating human instructions into executable code, and physical action generation. We evaluated the system's functionality through a series of household tasks, including the preparation of a desired salad upon human request. Bi-VLA demonstrates the ability to interpret complex human instructions, perceive and understand the visual context of ingredients, and execute precise bimanual actions to prepare the requested salad. We assessed the system's performance in terms of accuracy, efficiency, and adaptability to different salad recipes and human preferences through a series of experiments. Our results show a 100% success rate in generating the correct executable code by the Language Module, a 96.06% success rate in detecting specific ingredients by the Vision Module, and an overall success rate of 83.4% in correctly executing user-requested tasks.
翻译:本研究提出了Bi-VLA(视觉-语言-动作)模型,这是一种专为双手机器人灵巧操作设计的新型系统,它无缝集成了用于场景理解的视觉模块、用于将人类指令转化为可执行代码的语言理解模块,以及物理动作生成模块。我们通过一系列家庭任务评估了系统的功能,包括根据人类请求制备指定沙拉。Bi-VLA展示了其解析复杂人类指令、感知并理解食材视觉上下文、以及执行精确双手动作以制备所请求沙拉的能力。我们通过一系列实验评估了系统在不同沙拉配方和人类偏好下的准确性、效率和适应性。实验结果表明:语言模块生成正确可执行代码的成功率为100%,视觉模块检测特定食材的成功率为96.06%,正确执行用户请求任务的总体成功率为83.4%。