AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots

This paper presents AlignBot, a novel framework designed to optimize VLM-powered customized task planning for household robots by effectively aligning with user reminders. In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminders. To address these challenges, AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for GPT-4o. This adapter model internalizes diverse forms of user reminders-such as personalized preferences, corrective guidance, and contextual assistance-into structured instruction-formatted cues that prompt GPT-4o in generating customized task plans. Additionally, AlignBot integrates a dynamic retrieval mechanism that selects task-relevant historical successes as prompts for GPT-4o, further enhancing task planning accuracy. To validate the effectiveness of AlignBot, experiments are conducted in real-world household environments, which are constructed within the laboratory to replicate typical household settings. A multimodal dataset with over 1,500 entries derived from volunteer reminders is used for training and evaluation. The results demonstrate that AlignBot significantly improves customized task planning, outperforming existing LLM- and VLM-powered planners by interpreting and aligning with user reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline at 21.6%, reflecting a 65% improvement and over four times greater effectiveness. Supplementary materials are available at: https://yding25.com/AlignBot/

翻译：本文提出AlignBot，一种新颖的框架，旨在通过有效对齐用户提醒来优化家用机器人中基于视觉语言模型的自定义任务规划。在家庭环境中，由于提醒的数量有限、多样性不足以及多模态特性，将任务规划与用户提醒对齐面临重大挑战。为解决这些挑战，AlignBot采用一个经过微调的LLaVA-7B模型，作为GPT-4o的适配器。该适配器模型将多种形式的用户提醒——例如个性化偏好、纠正性指导和上下文辅助——内化为结构化的指令格式提示，从而驱动GPT-4o生成自定义任务计划。此外，AlignBot集成了一个动态检索机制，该机制选择与任务相关的历史成功案例作为GPT-4o的提示，进一步提升了任务规划的准确性。为验证AlignBot的有效性，我们在真实家庭环境中进行了实验，这些环境在实验室内构建以模拟典型的家庭场景。实验使用了一个包含超过1500条源自志愿者提醒的多模态数据集进行训练和评估。结果表明，AlignBot通过解释并对齐用户提醒，显著改进了自定义任务规划，其性能优于现有的基于大语言模型和视觉语言模型的规划器，取得了86.8%的成功率，而原始GPT-4o基线的成功率仅为21.6%，这反映了65%的性能提升及超过四倍的效能增益。补充材料可在以下网址获取：https://yding25.com/AlignBot/