Foundation models, e.g., large language models, possess attributes of intelligence which offer promise to endow a robot with the contextual understanding necessary to navigate complex, unstructured tasks in the wild. In the future of space robotics, we see three core challenges which motivate the use of a foundation model adapted to space-based applications: 1) Scalability of ground-in-the-loop operations; 2) Generalizing prior knowledge to novel environments; and 3) Multi-modality in tasks and sensor data. Therefore, as a first-step towards building a foundation model for space-based applications, we automatically label the AI4Mars dataset to curate a language annotated dataset of visual-question-answer tuples. We fine-tune a pretrained LLaVA checkpoint on this dataset to endow a vision-language model with the ability to perform spatial reasoning and navigation on Mars' surface. In this work, we demonstrate that 1) existing vision-language models are deficient visual reasoners in space-based applications, and 2) fine-tuning a vision-language model on extraterrestrial data significantly improves the quality of responses even with a limited training dataset of only a few thousand samples.
翻译:基础模型(例如大型语言模型)具备智能属性,有望赋予机器人对复杂、非结构化任务进行上下文理解的能力,从而在未知环境中实现自主导航。在空间机器人的未来发展中,我们识别出三大核心挑战,这些挑战促使我们采用适配空间应用的基础模型:1)地面在环操作的可扩展性;2)将先验知识泛化至新环境的能力;3)任务与传感器数据的多模态特性。因此,作为构建空间应用基础模型的第一步,我们自动标注AI4Mars数据集,构建了一个包含视觉-问题-答案三元组的语言标注数据集。我们基于该数据集对预训练的LLaVA检查点进行微调,赋予视觉-语言模型在火星表面进行空间推理与导航的能力。本研究表明:1)现有视觉-语言模型在空间应用中存在视觉推理能力不足的问题;2)即使仅使用数千样本的有限训练数据,基于地外数据对视觉-语言模型进行微调仍能显著提升其响应质量。