Embodied agents designed to assist users with tasks must engage in natural language interactions, interpret instructions, execute actions, and communicate effectively to resolve issues. However, collecting large-scale, diverse datasets of situated human-robot dialogues to train and evaluate such agents is expensive, labor-intensive, and time-consuming. To address this challenge, we propose building a large language model (LLM)-based user agent that can simulate user behavior during interactions with an embodied agent in a virtual environment. Given a user goal (e.g., make breakfast), at each time step, the user agent may observe" the robot actions or speak" to either intervene with the robot or answer questions. Such a user agent assists in improving the scalability and efficiency of embodied dialogues dataset generation and is critical for enhancing and evaluating the robot's interaction and task completion ability, as well as for research in reinforcement learning using AI feedback. We evaluate our user agent's ability to generate human-like behaviors by comparing its simulated dialogues with the TEACh dataset. We perform three experiments: zero-shot prompting to predict dialogue acts, few-shot prompting, and fine-tuning on the TEACh training subset. Results show the LLM-based user agent achieves an F-measure of 42% with zero-shot prompting and 43.4% with few-shot prompting in mimicking human speaking behavior. Through fine-tuning, performance in deciding when to speak remained stable, while deciding what to say improved from 51.1% to 62.5%. These findings showcase the feasibility of the proposed approach for assessing and enhancing the effectiveness of robot task completion through natural language communication.
翻译:旨在辅助用户完成任务的具身代理必须能够进行自然语言交互、解析指令、执行动作并有效沟通以解决问题。然而,为训练和评估此类代理而收集大规模、多样化的情境化人机对话数据集成本高昂、劳动密集且耗时。为应对这一挑战,我们提出构建一个基于大语言模型(LLM)的用户代理,该代理能够在虚拟环境中与具身代理交互时模拟用户行为。给定一个用户目标(例如,制作早餐),在每一个时间步,用户代理可以“观察”机器人动作或“说话”以干预机器人行为或回答问题。此类用户代理有助于提高具身对话数据集生成的可扩展性和效率,对于增强和评估机器人的交互与任务完成能力,以及基于AI反馈的强化学习研究至关重要。我们通过将模拟对话与TEACh数据集进行比较,评估了用户代理生成类人行为的能力。我们进行了三项实验:预测对话行为的零样本提示、少样本提示以及在TEACh训练子集上的微调。结果显示,基于LLM的用户代理在模仿人类说话行为方面,零样本提示的F值达到42%,少样本提示达到43.4%。通过微调,决定何时说话的性能保持稳定,而决定说什么的性能从51.1%提升至62.5%。这些发现证明了所提方法在通过自然语言通信评估和增强机器人任务完成效能方面的可行性。