Recent advancements in natural language and Large Language Models (LLMs) have enabled AI agents to simulate human-like interactions within virtual worlds. However, these interactions still face limitations in complexity and flexibility, particularly in scenarios involving multiple characters and novel objects. Pre-defining all interactable objects in the agent's world model presents challenges, and conveying implicit intentions to multiple characters through complex interactions remains difficult. To address these issues, we propose integrating virtual Game Masters (GMs) into the agent's world model, drawing inspiration from Tabletop Role-Playing Games (TRPGs). GMs play a crucial role in overseeing information, estimating players' intentions, providing environment descriptions, and offering feedback, compensating for current world model deficiencies. To facilitate future explorations for complex interactions, we introduce a benchmark named Tachikuma, comprising a Multiple character and novel Object based interaction Estimation (MOE) task and a supporting dataset. MOE challenges models to understand characters' intentions and accurately determine their actions within intricate contexts involving multi-character and novel object interactions. Besides, the dataset captures log data from real-time communications during gameplay, providing diverse, grounded, and complex interactions for further explorations. Finally, we present a simple prompting baseline and evaluate its performance, demonstrating its effectiveness in enhancing interaction understanding. We hope that our dataset and task will inspire further research in complex interactions with natural language, fostering the development of more advanced AI agents.
翻译:摘要: 自然语言处理与大型语言模型的最新进展,使AI智能体能够在虚拟世界中模拟类人交互。然而,这些交互在复杂性和灵活性方面仍面临限制,尤其在涉及多个角色和新物体的场景中。在智能体的世界模型中预先定义所有可交互物体存在挑战,而通过复杂交互向多个角色传达隐含意图依然困难。为解决这些问题,我们受桌上角色扮演游戏的启发,提出将虚拟游戏主持人集成到智能体的世界模型中。游戏主持人在信息监察、玩家意图评估、环境描述生成及反馈提供中发挥关键作用,能弥补当前世界模型的缺陷。为促进对复杂交互的进一步探索,我们提出了名为Tachikuma的基准测试,包含基于多角色与新物体的交互估计任务及其配套数据集。交互估计任务挑战模型在涉及多角色与新物体交互的复杂情境中理解角色意图,并准确推断其动作。此外,该数据集记录了游戏过程中实时通信的日志数据,为后续研究提供多样化、具象化且复杂的交互场景。最后,我们提出一种简单的提示基线方法并评估其性能,验证了其在增强交互理解方面的有效性。希望我们的数据集和任务能激发自然语言复杂交互领域的进一步研究,推动更先进AI智能体的发展。