This position paper calls on the research community not only to investigate how human biases are inherited by large language models (LLMs) but also to explore how these biases in LLMs can be leveraged to make society's "unwritten code" - such as implicit stereotypes and heuristics - visible and accessible for critique. We introduce a conceptual framework through a case study in science: uncovering hidden rules in peer review - the factors that reviewers care about but rarely state explicitly due to normative scientific expectations. The idea of the framework is to push LLMs to speak out their heuristics through generating self-consistent hypotheses - why one paper appeared stronger in reviewer scoring - among paired papers submitted to 46 academic conferences, while iteratively searching deeper hypotheses from remaining pairs where existing hypotheses cannot explain. We observed that LLMs' normative priors about the internal characteristics of good science extracted from their self-talk, e.g., theoretical rigor, were systematically updated toward posteriors that emphasize storytelling about external connections, such as how the work is positioned and connected within and across literatures. Human reviewers tend to explicitly reward aspects that moderately align with LLMs' normative priors (correlation = 0.49) but avoid articulating contextualization and storytelling posteriors in their review comments (correlation = -0.14), despite giving implicit reward to them with positive scores. These patterns are robust across different models and out-of-sample judgments. We discuss the broad applicability of our proposed framework, leveraging LLMs as diagnostic tools to amplify and surface the tacit codes underlying human society, enabling public discussion of revealed values and more precisely targeted responsible AI.
翻译:本立场文件呼吁研究界不仅应探究人类偏见如何被大型语言模型(LLM)所继承,更应探索如何利用LLM中的这些偏见,使社会的“隐性准则”——如隐含的刻板印象与启发式规则——变得可见且可供批判性审视。我们通过一个科学领域的案例研究提出概念框架:揭示同行评审中的隐藏规则——即评审人关注但因规范性科学期望而鲜少明确表述的因素。该框架的核心思想是,通过让LLM在提交至46个学术会议的成对论文中生成自洽假设(解释为何某篇论文在评审打分中表现更强),并迭代地从现有假设无法解释的剩余论文对中挖掘更深层假设,从而推动LLM表达其启发式规则。我们观察到,LLM从其自我对话中提取的关于优秀科学内在特征的规范性先验(如理论严谨性)被系统性地更新为强调外部关联叙事(如研究在文献内部及跨文献中的定位与连接)的后验。人类评审者倾向于明确奖励与LLM规范性先验中度一致的方面(相关性=0.49),却在评审意见中避免阐述情境化与叙事性后验(相关性=-0.14),尽管他们通过积极评分给予了隐性奖励。这些模式在不同模型及样本外判断中均保持稳健。我们讨论了所提框架的广泛适用性:将LLM作为诊断工具来放大并揭示人类社会底层的隐性准则,从而促进对已揭示价值的公开讨论,并实现更精准定向的负责任人工智能。