Large Language Models (LLMs) have opened transformative possibilities for human-robot collaboration. However, enabling real-time collaboration requires both low latency and robust reasoning, and most LLMs suffer from high latency. To address this gap, we first propose a fine-grained benchmark that explicitly assesses agents' proactive adaptability and temporal responsiveness in the Overcooked-AI environment. Based on evaluation results, we propose MonTA (Monitor-then-Adapt), a hierarchical framework inspired by cognitive science research. MonTA contains three key modules: a lightweight Monitor that operates at high frequency (7 Hz) to detect adaptation needs, and two proficient Adapters for subtask and path adaptation reasoning that provide instructions to humans at a lower frequency. Our results demonstrate that MonTA significantly outperforms baseline agents on our proposed benchmark, achieving superior performance across layouts with varying teaming fluency. User studies confirm the high reasonableness of adaptation plans and consistent language instructions provided by our framework to humans.
翻译:大型语言模型(LLM)为人机协作开辟了变革性的可能。然而,实现实时协作既需要低延迟,也需要鲁棒的推理能力,而大多数LLM都存在延迟过高的问题。为弥补这一差距,我们首先提出了一个细粒度基准测试,在Overcooked-AI环境中显式评估智能体的主动适应能力和时间响应性。基于评估结果,我们提出了MonTA(Monitor-then-Adapt),这是一个受认知科学研究启发的分层框架。MonTA包含三个关键模块:一个以高频(7 Hz)运行以检测适应需求的轻量级Monitor,以及两个分别用于子任务适应推理和路径适应推理的熟练Adapter,它们以较低频率向人类提供指令。我们的结果表明,MonTA在我们提出的基准测试中显著优于基线智能体,在不同布局中均实现了优异的团队协作流畅度性能。用户研究证实,我们的框架向人类提供的适应计划具有高度合理性,且语言指令保持一致性。