Evaluating the quality of open-domain chatbots has become increasingly reliant on LLMs acting as automatic judges. However, existing meta-evaluation benchmarks are static, outdated, and lacking in multilingual coverage, limiting their ability to fully capture subtle weaknesses in evaluation. We introduce MEDAL, an automated multi-agent framework for curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. Then, a strong LLM (GPT-4.1) is used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. Using MEDAL, we uncover that state-of-the-art judges fail to reliably detect nuanced issues such as lack of empathy, commonsense, or relevance.
翻译:评估开放域聊天机器人的质量已日益依赖于大语言模型作为自动评判者。然而,现有的元评估基准是静态的、过时的,并且缺乏多语言覆盖,限制了其全面捕捉评估中细微弱点的能力。我们提出了MEDAL,这是一个用于构建更具代表性和多样性的开放域对话评估基准的自动化多智能体框架。我们的方法利用多个先进的大语言模型,在多样化种子语境的条件下,生成用户-聊天机器人的多语言对话。随后,使用一个强大的大语言模型(GPT-4.1)对聊天机器人的表现进行多维度分析,揭示出显著的跨语言性能差异。在此大规模评估的指导下,我们构建了一个新的元评估多语言基准,并对样本进行了包含细微质量判断的人工标注。该基准随后被用于评估多个具备推理能力与不具备推理能力的大语言模型作为开放域对话评估器的能力。通过使用MEDAL,我们发现最先进的评判模型无法可靠地检测出诸如缺乏同理心、常识或相关性等细微问题。