Machine learning models -- including large language models (LLMs) -- are often said to exhibit monoculture, where outputs agree strikingly often. But what does it actually mean for models to agree too much? We argue that this question is inherently subjective, relying on two key decisions. First, the analyst must specify a baseline null model for what "independence" should look like. This choice is inherently subjective, and as we show, different null models result in dramatically different inferences about excess agreement. Second, we show that inferences depend on the population of models and items under consideration. Models that seem highly correlated in one context may appear independent when evaluated on a different set of questions, or against a different set of peers. Experiments on two large-scale benchmarks validate our theoretical findings. For example, we find drastically different inferences when using a null model with item difficulty compared to previous works that do not. Together, our results reframe monoculture evaluation not as an absolute property of model behavior, but as a context-dependent inference problem.
翻译:机器学习模型——包括大语言模型(LLMs)——常被认为表现出单一文化现象,即其输出结果惊人地频繁一致。但模型“过度一致”的实际含义究竟是什么?我们认为该问题本质上是主观的,其结论依赖于两个关键决策。首先,分析者必须为“独立性”设定一个基准零模型。这一选择具有内在主观性,正如我们所示,不同的零模型会导致对过度一致性的推断产生显著差异。其次,我们证明推断结果取决于所考察的模型群体和项目集合。在特定情境下高度相关的模型,当评估不同问题集或与不同模型群体对比时,可能呈现出独立性。在两个大规模基准测试上的实验验证了我们的理论发现。例如,采用包含项目难度的零模型与先前未考虑该因素的研究相比,我们得到了截然不同的推断结果。综上,我们的研究将单一文化评估重新定义为一种依赖情境的推断问题,而非模型行为的绝对属性。