The rapid progress in Large Language Models (LLMs) poses potential risks such as generating unethical content. Assessing LLMs' values can help expose their misalignment, but relies on reference-free evaluators, e.g., fine-tuned LLMs or close-source ones like GPT-4, to identify values reflected in generated responses. Nevertheless, these evaluators face two challenges in open-ended value evaluation: they should align with changing human value definitions with minimal annotation, against their own bias (adaptability), and detect varying value expressions and scenarios robustly (generalizability). To handle these challenges, we introduce CLAVE, a novel framework which integrates two complementary LLMs, a large one to extract high-level value concepts from a few human labels, leveraging its extensive knowledge and generalizability, and a smaller one fine-tuned on such concepts to better align with human value understanding. This dual-model approach enables calibration with any value systems using <100 human-labeled samples per value type. Then we present ValEval, a comprehensive dataset comprising 13k+ (text,value,label) tuples across diverse domains, covering three major value systems. We benchmark the capabilities of 12+ popular LLM evaluators and analyze their strengths and weaknesses. Our findings reveal that combining fine-tuned small models and prompt-based large ones serves as a superior balance in value evaluation.
翻译:大语言模型(LLM)的快速发展带来了生成不道德内容等潜在风险。评估LLM的价值取向有助于揭示其偏差,但这依赖于无参考评估器(例如微调后的LLM或GPT-4等闭源模型)来识别生成响应中反映的价值观。然而,这些评估器在开放式价值评估中面临两大挑战:它们需要以最少标注量适应动态变化的人类价值定义并克服自身偏见(适应性),同时还需稳健地检测多样化的价值表达与场景(泛化性)。为应对这些挑战,我们提出了CLAVE——一个集成双互补LLM的创新框架。该框架利用大型LLM从少量人工标注中提取高层次价值概念,充分发挥其广博知识储备与泛化能力;同时使用基于此类概念微调的小型LLM以更好地契合人类价值认知。这种双模型架构仅需每种价值类型不足100个人工标注样本,即可实现与任意价值体系的校准。此外,我们构建了ValEval数据集,涵盖三大主流价值体系,包含跨领域13,000余组(文本,价值,标签)三元组。我们对12种以上主流LLM评估器进行了基准测试,系统分析了其优劣特性。研究发现,结合微调小型模型与基于提示的大型模型,能在价值评估中实现更优的平衡。