LLM-based agent judges are an emerging approach to evaluating conversational AI, yet a fundamental uncertainty remains: can we trust their assessments, and if so, how many are needed? Through 960 sessions with two model pairs across 15 tasks, we show that persona-based agent judges produce evaluations indistinguishable from human raters in a Turing-style validation. We then identify a score-coverage dissociation: quality scores improve logarithmically with panel size, while unique issue discoveries follow a sublinear power law-both exhibit diminishing returns, but scores saturate roughly twice as fast as discoveries. We hypothesize this reflects a power law distribution of the finding space: critical issues are discovered first by small panels, while corner cases require progressively larger panels, analogous to species accumulation curves in ecology. The mechanism traces to ensemble diversity-Big Five personality conditioning makes agents probe different quality dimensions, with expert judges acting as adversarial probes that push discovery into the tail of the finding distribution. A controlled ablation confirms that structured persona conditioning, not simple prompting, is required to produce these scaling properties.
翻译:基于大语言模型的智能体评判正成为评估对话AI的新兴方法,但一个根本性问题仍然存在:我们能否信任其评估结果?若可信,又需要多少评估样本?通过在两对模型间15个任务中开展960轮会话实验,我们发现:在类图灵验证框架下,基于人格特征的智能体评判产生的评估结果与人类评分员不可区分。继而识别出评分-覆盖范围解耦现象:质量评分随评估小组规模呈对数提升,而独特议题发现量遵循亚线性幂律增长——两者均呈现边际效益递减,但评分饱和速度约为发现速度的两倍。我们假设这反映了发现空间的幂律分布:关键议题可通过小规模小组率先发现,而边界案例需要逐步扩大评估小组,这类似于生态学中的物种累积曲线。其机制源自集成多样性——大五人格调节使智能体探测不同质量维度,其中专家评审充当对抗性探测器,推动发现进入分布尾端。控制变量消融实验证实:产生这些标度特性需要结构化人格调节,而非简单提示工程。