Large Language Models (LLMs) are increasingly embedded in evaluative processes, from information filtering to assessing and addressing knowledge gaps through explanation and credibility judgments. This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans. We benchmark six LLMs against expert ratings--NewsGuard and Media Bias/Fact Check--and against human judgments collected through a controlled experiment. We use news domains purely as a controlled benchmark for evaluative tasks, focusing on the underlying mechanisms rather than on news classification per se. To enable direct comparison, we implement a structured agentic framework in which both models and nonexpert participants follow the same evaluation procedure: selecting criteria, retrieving content, and producing justifications. Despite output alignment, our findings show consistent differences in the observable criteria guiding model evaluations, suggesting that lexical associations and statistical priors could influence evaluations in ways that differ from contextual reasoning. This reliance is associated with systematic effects: political asymmetries and a tendency to confuse linguistic form with epistemic reliability--a dynamic we term epistemia, the illusion of knowledge that emerges when surface plausibility replaces verification. Indeed, delegating judgment to such systems may affect the heuristics underlying evaluative processes, suggesting a shift from normative reasoning toward pattern-based approximation and raising open questions about the role of LLMs in evaluative processes.
翻译:大型语言模型(LLMs)正日益融入评估流程,从信息过滤到通过解释与可信度判断来评估和弥补知识缺口。这引发了对以下问题的审视需求:此类评估如何构建、依赖何种假设,以及其策略与人类判断策略的差异。我们以专家评级(NewsGuard和Media Bias/Fact Check)及通过受控实验收集的人类判断为基准,对六种LLM进行了测评。本研究将新闻领域纯粹作为评估任务的受控基准,重点关注底层机制而非新闻分类本身。为实现直接比较,我们构建了结构化智能体框架,使模型与非专业参与者遵循相同的评估流程:选择标准、检索内容并生成论证。尽管输出结果存在一致性,但研究发现模型评估所依据的可观测标准存在系统性差异,表明词汇关联与统计先验可能以不同于语境推理的方式影响评估。这种依赖性与系统性效应相关:政治不对称性以及将语言形式与认知可靠性相混淆的倾向——我们将这种动态称为“认知幻象”,即当表面合理性取代验证时产生的知识错觉。事实上,将判断委托给此类系统可能影响评估过程的启发式机制,暗示着从规范性推理向基于模式的近似判断的转变,并对LLMs在评估过程中的作用提出了开放性问题。