MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings

from arxiv, A benchmark for evaluating multimodal both voice and text LLM agents in dualcontrol settings. We introduce persona adaptive prompting and 12 new metrics to assess robustness safety efficiency and recovery in customer support scenarios

Current evaluation frameworks and benchmarks for LLM powered agents focus on text chat driven agents, these frameworks do not expose the persona of user to the agent, thus operating in a user agnostic environment. Importantly, in customer experience management domain, the agent's behaviour evolves as the agent learns about user personality. With proliferation of real time TTS and multi-modal language models, LLM based agents are gradually going to become multi-modal. Towards this, we propose the MM-tau-p$^2$ benchmark with metrics for evaluating the robustness of multi-modal agents in dual control setting with and without persona adaption of user, while also taking user inputs in the planning process to resolve a user query. In particular, our work shows that even with state of-the-art frontier LLMs like GPT-5, GPT 4.1, there are additional considerations measured using metrics viz. multi-modal robustness, turn overhead while introducing multi-modality into LLM based agents. Overall, MM-tau-p$^2$ builds on our prior work FOCAL and provides a holistic way of evaluating multi-modal agents in an automated way by introducing 12 novel metrics. We also provide estimates of these metrics on the telecom and retail domains by using the LLM-as-judge approach using carefully crafted prompts with well defined rubrics for evaluating each conversation.

翻译：当前针对基于大语言模型（LLM）的智能体的评估框架与基准主要聚焦于文本对话驱动的智能体，这些框架未将用户角色信息暴露给智能体，因而在用户无关的环境中运行。值得注意的是，在客户体验管理领域，智能体的行为会随着其对用户个性的了解而动态演变。随着实时文本转语音（TTS）与多模态语言模型的普及，基于LLM的智能体正逐步向多模态方向发展。为此，我们提出了MM-tau-p$^2$基准及其配套度量指标，用于评估多模态智能体在双控制场景下的鲁棒性，涵盖有无用户角色自适应两种情况，同时在规划过程中纳入用户输入以解决用户查询。特别地，我们的研究表明，即使采用如GPT-5、GPT-4.1等最先进的尖端LLM，在将多模态能力引入基于LLM的智能体时，仍需通过度量指标（如多模态鲁棒性、交互轮次开销）考量额外因素。总体而言，MM-tau-p$^2$在我们先前工作FOCAL的基础上，通过引入12项新颖度量指标，提供了一种自动化评估多模态智能体的整体性方法。我们还基于LLM即评判者（LLM-as-judge）方法，通过精心设计的提示语与明确定义的评分准则，在电信与零售领域对这些度量指标进行了估算。