Robo-Instruct: Simulator-Augmented Instruction Alignment For Finetuning CodeLLMs

Large language models (LLMs) have shown great promise at generating robot programs from natural language given domain-specific robot application programming interfaces (APIs). However, the performance gap between proprietary LLMs and smaller open-weight LLMs remains wide. This raises a question: Can we fine-tune smaller open-weight LLMs for generating domain-specific robot programs to close the performance gap with proprietary LLMs? While Self-Instruct is a promising solution by generating a diverse set of training data, it cannot verify the correctness of these programs. In contrast, a robot simulator with a well-defined world can identify execution errors but limits the diversity of programs that it can verify. In this work, we introduce Robo-Instruct, which brings the best of both worlds -- it promotes the diversity of Self-Instruct while providing the correctness of simulator-based checking. Robo-Instruct introduces RoboSim to synthesize a consistent world state on the fly by inferring properties relevant to the program being checked, and simulating actions accordingly. Furthermore, the instructions and programs generated by Self-Instruct may be subtly inconsistent -- such as the program missing a step implied by the instruction. Robo-Instruct further addresses this with InstAlign, an instruction-program alignment procedure that revises the task instruction to reflect the actual results of the generated program. Given a few seed task descriptions and the robot APIs, Robo-Instruct is capable of generating a training dataset using only a small open-weight model. This dataset can then be used to fine-tune small open-weight language models, enabling them to match or even exceed the performance of several proprietary LLMs, such as GPT-3.5-Turbo and Gemini-Pro.

翻译：大型语言模型（LLM）在根据特定领域机器人应用编程接口（API）从自然语言生成机器人程序方面展现出巨大潜力。然而，专有LLM与较小规模的开源权重LLM之间的性能差距仍然显著。这引发了一个问题：我们能否通过微调较小的开源权重LLM来生成特定领域的机器人程序，从而缩小与专有LLM的性能差距？尽管Self-Instruct通过生成多样化的训练数据提供了一种有前景的解决方案，但它无法验证这些程序的正确性。相比之下，具有明确定义世界的机器人模拟器可以识别执行错误，但限制了其可验证程序的多样性。在本工作中，我们提出了Robo-Instruct，它融合了两者的优势——既保持了Self-Instruct的多样性，又提供了基于模拟器检查的正确性。Robo-Instruct引入了RoboSim，通过推断与待检查程序相关的属性并相应地模拟动作，动态合成一致的世界状态。此外，Self-Instruct生成的指令和程序可能存在细微的不一致——例如程序遗漏了指令隐含的步骤。Robo-Instruct进一步通过InstAlign解决了这一问题，该指令-程序对齐程序会修订任务指令以反映生成程序的实际执行结果。给定少量种子任务描述和机器人API，Robo-Instruct能够仅使用小型开源权重模型生成训练数据集。该数据集随后可用于微调小型开源权重的语言模型，使其能够匹配甚至超越如GPT-3.5-Turbo和Gemini-Pro等若干专有LLM的性能。