Current evaluation frameworks and benchmarks for LLM powered agents focus on text chat driven agents, these frameworks do not expose the persona of user to the agent, thus operating in a user agnostic environment. Importantly, in customer experience management domain, the agent's behaviour evolves as the agent learns about user personality. With proliferation of real time TTS and multi-modal language models, LLM based agents are gradually going to become multi-modal. Towards this, we propose the MM-tau-p$^2$ benchmark with metrics for evaluating the robustness of multi-modal agents in dual control setting with and without persona adaption of user, while also taking user inputs in the planning process to resolve a user query. In particular, our work shows that even with state of-the-art frontier LLMs like GPT-5, GPT 4.1, there are additional considerations measured using metrics viz. multi-modal robustness, turn overhead while introducing multi-modality into LLM based agents. Overall, MM-tau-p$^2$ builds on our prior work FOCAL and provides a holistic way of evaluating multi-modal agents in an automated way by introducing 12 novel metrics. We also provide estimates of these metrics on the telecom and retail domains by using the LLM-as-judge approach using carefully crafted prompts with well defined rubrics for evaluating each conversation.
翻译:当前针对基于大语言模型(LLM)的智能体的评估框架与基准主要聚焦于文本对话驱动的智能体,这些框架未将用户角色信息暴露给智能体,因而在用户无关的环境中运行。值得注意的是,在客户体验管理领域,智能体的行为会随着其对用户个性的了解而动态演变。随着实时文本转语音(TTS)与多模态语言模型的普及,基于LLM的智能体正逐步向多模态方向发展。为此,我们提出了MM-tau-p$^2$基准及其配套度量指标,用于评估多模态智能体在双控制场景下的鲁棒性,涵盖有无用户角色自适应两种情况,同时在规划过程中纳入用户输入以解决用户查询。特别地,我们的研究表明,即使采用如GPT-5、GPT-4.1等最先进的尖端LLM,在将多模态能力引入基于LLM的智能体时,仍需通过度量指标(如多模态鲁棒性、交互轮次开销)考量额外因素。总体而言,MM-tau-p$^2$在我们先前工作FOCAL的基础上,通过引入12项新颖度量指标,提供了一种自动化评估多模态智能体的整体性方法。我们还基于LLM即评判者(LLM-as-judge)方法,通过精心设计的提示语与明确定义的评分准则,在电信与零售领域对这些度量指标进行了估算。