PerspAct: Enhancing LLM Situated Collaboration Skills through Perspective Taking and Active Vision

Sabrina Patania,Luca Annese,Anita Pellegrini,Silvia Serino,Anna Lambiase,Luca Pallonetto,Silvia Rossi,Simone Colombani,Tom Foulsham,Azzurra Ruggeri,Dimitri Ognibene

from arxiv, Accepted at IAS19

Recent advances in Large Language Models (LLMs) and multimodal foundation models have significantly broadened their application in robotics and collaborative systems. However, effective multi-agent interaction necessitates robust perspective-taking capabilities, enabling models to interpret both physical and epistemic viewpoints. Current training paradigms often neglect these interactive contexts, resulting in challenges when models must reason about the subjectivity of individual perspectives or navigate environments with multiple observers. This study evaluates whether explicitly incorporating diverse points of view using the ReAct framework, an approach that integrates reasoning and acting, can enhance an LLM's ability to understand and ground the demands of other agents. We extend the classic Director task by introducing active visual exploration across a suite of seven scenarios of increasing perspective-taking complexity. These scenarios are designed to challenge the agent's capacity to resolve referential ambiguity based on visual access and interaction, under varying state representations and prompting strategies, including ReAct-style reasoning. Our results demonstrate that explicit perspective cues, combined with active exploration strategies, significantly improve the model's interpretative accuracy and collaborative effectiveness. These findings highlight the potential of integrating active perception with perspective-taking mechanisms in advancing LLMs' application in robotics and multi-agent systems, setting a foundation for future research into adaptive and context-aware AI systems.

翻译：近年来，大语言模型（LLMs）与多模态基础模型的进展显著拓宽了其在机器人学与协作系统中的应用范围。然而，有效的多智能体交互需要强大的视角采择能力，使模型能够理解物理与认知层面的不同观点。当前的训练范式往往忽视这些交互情境，导致模型在需要推理个体视角的主观性或处理多观察者环境时面临挑战。本研究评估了在ReAct框架（一种融合推理与行动的方法）中显式引入多样化视角，是否能提升大语言模型理解并响应其他智能体需求的能力。我们通过在一系列七个视角采择复杂度递增的场景中引入主动视觉探索，扩展了经典的Director任务。这些场景旨在测试智能体在不同状态表示与提示策略（包括ReAct式推理）下，基于视觉访问与交互解决指代歧义的能力。实验结果表明，显式的视角提示结合主动探索策略，显著提高了模型的解释准确性与协作效能。这些发现凸显了将主动感知与视角采择机制相结合在推动大语言模型于机器人及多智能体系统应用中的潜力，为未来研究自适应与情境感知人工智能系统奠定了基础。