Language Models Should be Used to Surface the Unwritten Code of Science and Society

This paper calls on the research community not only to investigate how human biases are inherited by large language models (LLMs) but also to explore how these biases in LLMs can be leveraged to make society's "unwritten code" - such as implicit stereotypes and heuristics - visible and accessible for critique. We introduce a conceptual framework through a case study in science: uncovering hidden rules in peer review - the factors that reviewers care about but rarely state explicitly due to normative scientific expectations. The idea of the framework is to push LLMs to speak out their heuristics through generating self-consistent hypotheses - why one paper appeared stronger in reviewer scoring - among paired papers submitted to 46 academic conferences, while iteratively searching deeper hypotheses from remaining pairs where existing hypotheses cannot explain. We observed that LLMs' normative priors about the internal characteristics of good science extracted from their self-talk, e.g., theoretical rigor, were systematically updated toward posteriors that emphasize storytelling about external connections, such as how the work is positioned and connected within and across literatures. Human reviewers tend to explicitly reward aspects that moderately align with LLMs' normative priors (correlation = 0.49) but avoid articulating contextualization and storytelling posteriors in their review comments (correlation = -0.14), despite giving implicit reward to them with positive scores. These patterns are robust across different models and out-of-sample judgments. We discuss the broad applicability of our proposed framework, leveraging LLMs as diagnostic tools to amplify and surface the tacit codes underlying human society, enabling public discussion of revealed values and more precisely targeted responsible AI.

翻译：本文呼吁研究界不仅应探究人类偏见如何被大型语言模型（LLMs）所继承，更应探索如何利用LLMs中的这些偏见，使社会的“隐性规则”——如隐含的刻板印象与启发式认知——变得可见且可供批判性审视。我们通过一项科学领域的案例研究提出一个概念框架：揭示同行评审中的隐藏规则——即评审人基于规范性科学期望而重视却极少明确陈述的要素。该框架的核心思想是：通过让LLMs对提交至46个学术会议的成对论文生成自洽假设（解释为何某篇论文在评审打分中表现更强），并迭代地从现有假设无法解释的剩余论文对中挖掘更深层假设，从而推动LLMs表达其内在启发式规则。我们观察到，LLMs从其自我对话中提取的关于优秀科学内在特征的规范性先验（如理论严谨性）被系统性地更新为强调外部关联叙事（如研究在文献内部及跨文献中的定位与连接）的后验。人类评审者倾向于明确奖励与LLMs规范性先验中度一致的方面（相关系数=0.49），却在评审意见中避免阐述语境化与叙事性后验（相关系数=-0.14），尽管他们通过积极评分给予这些后验隐性奖励。这些模式在不同模型及样本外判断中均保持稳健。我们讨论了所提框架的广泛适用性：将LLMs作为诊断工具，以放大并揭示人类社会底层的隐性规则，从而促进对已揭示价值观的公共讨论，并实现更精准定向的责任人工智能。