大型语言模型中的判断模拟 (The simulation of judgment in LLMs)

Edoardo Loru,Jacopo Nudo,Niccolò Di Marco,Alessandro Santirocchi,Roberto Atzeni,Matteo Cinelli,Vincenzo Cestari,Clelia Rossi-Arnaud,Walter Quattrociocchi

from arxiv, Please refer to published version: https://doi.org/10.1073/pnas.2518443122

Large Language Models (LLMs) are increasingly embedded in evaluative processes, from information filtering to assessing and addressing knowledge gaps through explanation and credibility judgments. This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans. We benchmark six LLMs against expert ratings--NewsGuard and Media Bias/Fact Check--and against human judgments collected through a controlled experiment. We use news domains purely as a controlled benchmark for evaluative tasks, focusing on the underlying mechanisms rather than on news classification per se. To enable direct comparison, we implement a structured agentic framework in which both models and nonexpert participants follow the same evaluation procedure: selecting criteria, retrieving content, and producing justifications. Despite output alignment, our findings show consistent differences in the observable criteria guiding model evaluations, suggesting that lexical associations and statistical priors could influence evaluations in ways that differ from contextual reasoning. This reliance is associated with systematic effects: political asymmetries and a tendency to confuse linguistic form with epistemic reliability--a dynamic we term epistemia, the illusion of knowledge that emerges when surface plausibility replaces verification. Indeed, delegating judgment to such systems may affect the heuristics underlying evaluative processes, suggesting a shift from normative reasoning toward pattern-based approximation and raising open questions about the role of LLMs in evaluative processes.

翻译：大型语言模型（LLMs）正日益融入评估流程，从信息过滤到通过解释与可信度判断来评估和弥补知识缺口。这引发了对以下问题的审视需求：此类评估如何构建、依赖何种假设，以及其策略与人类判断策略的差异。我们以专家评级（NewsGuard和Media Bias/Fact Check）及通过受控实验收集的人类判断为基准，对六种LLM进行了测评。本研究将新闻领域纯粹作为评估任务的受控基准，重点关注底层机制而非新闻分类本身。为实现直接比较，我们构建了结构化智能体框架，使模型与非专业参与者遵循相同的评估流程：选择标准、检索内容并生成论证。尽管输出结果存在一致性，但研究发现模型评估所依据的可观测标准存在系统性差异，表明词汇关联与统计先验可能以不同于语境推理的方式影响评估。这种依赖性与系统性效应相关：政治不对称性以及将语言形式与认知可靠性相混淆的倾向——我们将这种动态称为“认知幻象”，即当表面合理性取代验证时产生的知识错觉。事实上，将判断委托给此类系统可能影响评估过程的启发式机制，暗示着从规范性推理向基于模式的近似判断的转变，并对LLMs在评估过程中的作用提出了开放性问题。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日