Situated embodied conversation requires robots to interleave real-time dialogue with active perception: deciding what to look at, when to look, and what to say under tight latency constraints. We present a simple, minimal system recipe that pairs a real-time multimodal language model with a small set of tool interfaces for attention and active perception. We study six home-style scenarios that require frequent attention shifts and increasing perceptual scope. Across four system variants, we evaluate turn-level tool-decision correctness against human annotations and collect subjective ratings of interaction quality. Results indicate that real-time multimodal large language models and tool use for active perception is a promising direction for practical situated embodied conversation.
翻译:情境化具身对话要求机器人在严格延迟约束下,将实时对话与主动感知进行交织:决定何时观察、观察何处以及如何回应。本文提出一种简洁、最小化的系统构建方案,将实时多模态语言模型与一组用于注意力分配和主动感知的轻量工具接口相结合。我们研究了六个需要频繁注意力转移且感知范围逐步扩大的家庭场景。通过四种系统变体的对比实验,我们基于人工标注评估了任务轮次中工具决策的正确性,并收集了交互质量的主观评分。结果表明,采用实时多模态大语言模型结合主动感知工具调用,是实现实用化情境化具身对话的可行方向。