While Large Language Models (LLMs) based agents have successfully mimicked human behaviors in various scenarios, the realm of complex, multi-character social interactions within extended contexts remains underexplored. The challenge is compounded by privacy concerns, making it difficult to capture and utilize intricate real-life interactions. More importantly, the absence of quantitative evaluation methods hampers the pursuit of high-quality agent interactions, often leading to interactions that are limited in informativeness and expressiveness, characterized by superficial small talk without clear intentions. In this work, we leverage the rules of Tabletop Role-Playing Games (TRPG) to create an environment conducive to complex, context-rich interactions, emphasizing informativeness and expressiveness. This virtual setting alleviates privacy concerns and motivates agents to engage in meaningful, high-quality interactions as part of their in-game objectives. To assess these interactions, we introduce the Agent interaction Evaluation framework (AntEval), targeting the qualitative evaluation of interaction informativeness and expressiveness. Specifically, we propose two novel evaluation metrics: Information Exchanging Precision (IEP) and Interaction Expressiveness Gap (IEG). These metrics are designed to assess interactions in scenarios focused on information exchange and intention expression, respectively. Our experimental results demonstrate the effectiveness of these metrics in evaluating interaction quality. Notably, we identify significant areas for improvement in LLMs regarding social interactions, as highlighted by our metrics. We believe AntEval will guide further exploration in complex agent interactions, bringing them closer to emulating real human behavior and enhancing their integration and utility in real-world applications.
翻译:基于大语言模型的智能体在多种场景中已成功模仿人类行为,但在复杂多角色、长上下文互动领域仍探索不足。隐私问题加剧了这一挑战——难以捕获并利用复杂的现实交互模式。更重要的是,缺乏量化评估方法限制了高质量智能体互动的实现,导致互动常陷于信息贫乏、表现力不足的表层闲聊,缺乏明确意图。本研究利用桌上角色扮演游戏规则构建促进复杂语境下信息性与表现力的交互环境。该虚拟场景不仅缓解隐私顾虑,还能激励智能体将高质量、有意义的互动作为游戏目标。为评估此类互动,我们提出智能体互动评估框架(AntEval),聚焦互动信息性与表现力的定性评估。具体而言,我们设计两个新型评估指标:信息交换精确度(IEP)与互动表现力差距(IEG)。前者评估信息交换场景,后者评估意图表达场景。实验结果表明,该指标能有效评估交互质量。值得注意的是,我们的指标揭示了大语言模型在社交互动中的显著改进空间。相信AntEval将引导复杂智能体互动的深层探索,使其更贴近真实人类行为,从而增强其在现实应用中的集成性与实用性。