From school playgrounds to corporate boardrooms, status hierarchies -- rank orderings based on respect and perceived competence -- are universal features of human social organization. Language models trained on human-generated text inevitably encounter these hierarchical patterns embedded in language, raising the question of whether they might reproduce such dynamics in multi-agent settings. This thesis investigates when and how language models form status hierarchies by adapting Berger et al.'s (1972) expectation states framework. I create multi-agent scenarios where separate language model instances complete sentiment classification tasks, are introduced with varying status characteristics (e.g., credentials, expertise), then have opportunities to revise their initial judgments after observing their partner's responses. The dependent variable is deference, the rate at which models shift their ratings toward their partner's position based on status cues rather than task information. Results show that language models form significant status hierarchies when capability is equal (35 percentage point asymmetry, p < .001), but capability differences dominate status cues, with the most striking effect being that high-status assignments reduce higher-capability models' deference rather than increasing lower-capability models' deference. The implications for AI safety are significant: status-seeking behavior could introduce deceptive strategies, amplify discriminatory biases, and scale across distributed deployments far faster than human hierarchies form organically. This work identifies emergent social behaviors in AI systems and highlights a previously underexplored dimension of the alignment challenge.
翻译:从学校操场到企业董事会,地位层级——基于尊重与感知能力形成的等级排序——是人类社会组织中普遍存在的特征。在人类生成的文本上训练的语言模型不可避免地会接触到语言中嵌入的这些层级模式,这引发了一个问题:它们是否会在多智能体环境中重现此类动态。本论文通过采用Berger等人(1972)的期望状态理论框架,研究语言模型在何时以及如何形成地位层级。我创建了多智能体场景,其中独立的语言模型实例完成情感分类任务,被赋予不同的地位特征(如资历、专业知识),并在观察到合作伙伴的响应后有机会修正其初始判断。因变量为顺从度,即模型基于地位线索(而非任务信息)将其评分向合作伙伴立场调整的比率。结果表明,当能力相当时,语言模型会形成显著的地位层级(35个百分点不对称性,p < .001),但能力差异会主导地位线索,最显著的影响是:高地位分配会降低高能力模型的顺从度,而非提高低能力模型的顺从度。这对AI安全性具有重要影响:追求地位的行为可能引入欺骗性策略、放大歧视性偏见,并在分布式部署中以远快于人类层级自然形成的速度扩展。这项工作识别了AI系统中涌现的社会行为,并凸显了对齐挑战中一个先前未被充分探索的维度。