Using LLMs to evaluate text, that is, LLM-as-a-judge, is increasingly being used at scale to augment or even replace human annotations. As such, it is imperative that we understand the potential biases and risks of doing so. In this work, we propose an approach for extracting high-level concept-based global policies from LLM-as-a-Judge. Our approach consists of two algorithms: 1) CLoVE (Contrastive Local Verifiable Explanations), which generates verifiable, concept-based, contrastive local explanations and 2) GloVE (Global Verifiable Explanations), which uses iterative clustering, summarization and verification to condense local rules into a global policy. We evaluate GloVE on seven standard benchmarking datasets for content harm detection. We find that the extracted global policies are highly faithful to decisions of the LLM-as-a-Judge. Additionally, we evaluated the robustness of global policies to text perturbations and adversarial attacks. Finally, we conducted a user study to evaluate user understanding and satisfaction with global policies.
翻译:利用大型语言模型评估文本(即LLM-as-a-Judge)正日益被大规模用于增强甚至取代人工标注。因此,我们必须理解这种做法可能存在的偏见与风险。本研究提出一种从LLM-as-Judge中提取基于高层概念的全局策略的方法。该方法包含两种算法:1)CLoVE(对比式局部可验证解释),用于生成可验证的、基于概念的对比式局部解释;2)GloVE(全局可验证解释),通过迭代聚类、摘要生成与验证将局部规则凝练为全局策略。我们在七个内容危害检测标准基准数据集上评估GloVE方法。研究发现,提取的全局策略对LLM-as-a-Judge的决策具有高度忠实性。此外,我们评估了全局策略对文本扰动与对抗攻击的鲁棒性。最后,通过用户研究评估了用户对全局策略的理解度与满意度。