Large language models (LLMs) are increasingly used as agents to solve complex tasks such as question answering (QA), scientific debate, and software development. A standard evaluation procedure aggregates multiple responses from LLM agents into a single final answer, often via majority voting, and compares it against reference answers. However, this process can obscure the quality and distributional characteristics of the original responses. In this paper, we propose a novel evaluation framework based on the empirical cumulative distribution function (ECDF) of cosine similarities between generated responses and reference answers. This enables a more nuanced assessment of response quality beyond exact match metrics. To analyze the response distributions across different agent configurations, we further introduce a clustering method for ECDFs using their distances and the $k$-medoids algorithm. Our experiments on a QA dataset demonstrate that ECDFs can distinguish between agent settings with similar final accuracies but different quality distributions. The clustering analysis also reveals interpretable group structures in the responses, offering insights into the impact of temperature, persona, and question topics.
翻译:大型语言模型(LLM)正日益作为智能体被用于解决复杂任务,如问答(QA)、科学辩论和软件开发。标准的评估流程通常通过多数投票等方式,将LLM智能体的多个响应聚合成单一最终答案,并与参考答案进行比较。然而,这一过程可能掩盖原始响应的质量与分布特性。本文提出一种新颖的评估框架,其基于生成响应与参考答案间余弦相似度的经验累积分布函数(ECDF),从而能够超越精确匹配指标,对响应质量进行更细致的评估。为分析不同智能体配置下的响应分布,我们进一步引入一种基于ECDF距离与$k$-medoids算法的ECDF聚类方法。在QA数据集上的实验表明,ECDF能够区分最终准确率相近但质量分布不同的智能体设置。聚类分析还揭示了响应中可解释的群组结构,为温度参数、角色设定及问题主题的影响提供了深入见解。