Instruction generation is a vital and multidisciplinary research area with broad applications. Existing instruction generation models are limited to generating instructions in a single style from a particular dataset, and the style and content of generated instructions cannot be controlled. Moreover, most existing instruction generation methods also disregard the spatial modeling of the navigation environment. Leveraging the capabilities of Large Language Models (LLMs), we propose C-Instructor, which utilizes the chain-of-thought-style prompt for style-controllable and content-controllable instruction generation. Firstly, we propose a Chain of Thought with Landmarks (CoTL) mechanism, which guides the LLM to identify key landmarks and then generate complete instructions. CoTL renders generated instructions more accessible to follow and offers greater controllability over the manipulation of landmark objects. Furthermore, we present a Spatial Topology Modeling Task to facilitate the understanding of the spatial structure of the environment. Finally, we introduce a Style-Mixed Training policy, harnessing the prior knowledge of LLMs to enable style control for instruction generation based on different prompts within a single model instance. Extensive experiments demonstrate that instructions generated by C-Instructor outperform those generated by previous methods in text metrics, navigation guidance evaluation, and user studies.
翻译:指令生成是一个至关重要且跨学科的研究领域,具有广泛的应用前景。现有的指令生成模型仅限于从特定数据集中生成单一风格的指令,且无法控制生成指令的风格与内容。此外,大多数现有指令生成方法也忽视了导航环境的空间建模。利用大语言模型的能力,我们提出了C-Instructor,它采用思维链风格的提示来实现风格可控与内容可控的指令生成。首先,我们提出了一种结合地标的思维链机制,该机制引导LLM先识别关键地标,再生成完整的指令。CoTL使得生成的指令更易于遵循,并提供了对地标对象操作的更强可控性。此外,我们提出了一个空间拓扑建模任务,以促进对环境空间结构的理解。最后,我们引入了一种风格混合训练策略,利用LLM的先验知识,实现在单个模型实例中基于不同提示的指令生成风格控制。大量实验表明,C-Instructor生成的指令在文本指标、导航引导评估和用户研究中均优于先前方法生成的指令。