Robust, faithful and harm-free pronoun use for individuals is an important goal for language models as their use increases, but prior work tends to study only one or two of these components at a time. To measure progress towards the combined goal, we introduce the task of pronoun use fidelity: given a context introducing a co-referring entity and pronoun, the task is to reuse the correct pronoun later, independent of potential distractors. We present a carefully-designed dataset of over 5 million instances to evaluate pronoun use fidelity in English, and we use it to evaluate 37 popular large language models across architectures (encoder-only, decoder-only and encoder-decoder) and scales (11M-70B parameters). We find that while models can mostly faithfully reuse previously-specified pronouns in the presence of no distractors, they are significantly worse at processing she/her/her, singular they and neopronouns. Additionally, models are not robustly faithful to pronouns, as they are easily distracted. With even one additional sentence containing a distractor pronoun, accuracy drops on average by 34%. With 5 distractor sentences, accuracy drops by 52% for decoder-only models and 13% for encoder-only models. We show that widely-used large language models are still brittle, with large gaps in reasoning and in processing different pronouns in a setting that is very simple for humans, and we encourage researchers in bias and reasoning to bridge them.
翻译:随着语言模型应用范围的扩大,实现对个体代词鲁棒、忠实且无害的使用成为重要目标,但现有研究通常仅单独关注其中一两个维度。为衡量实现这一综合目标的进展,我们提出代词使用忠实度任务:给定引入共指实体及代词的上下文,要求模型在潜在干扰项存在时仍能正确复用代词。我们精心构建了包含500余万实例的数据集用于评估英语代词使用忠实度,并以此评估了涵盖不同架构(仅编码器、仅解码器、编码器-解码器)及参数量级(1100万至700亿)的37种主流大语言模型。研究发现:虽然模型在无干扰项情况下大多能忠实复用先前指定的代词,但在处理"she/her/her"、单数they及新代词时表现显著下降。此外,模型对代词并不具备鲁棒忠实性,极易受到干扰——仅增加一个包含干扰代词的分句,准确率平均下降34%;当干扰分句增至5个时,仅解码器模型准确率下降52%,仅编码器模型下降13%。研究表明,广泛使用的大语言模型仍存在脆弱性,在人类极易处理的场景中,其在推理及不同代词处理上存在显著能力缺口,我们呼吁偏见与推理领域研究者共同弥合这一鸿沟。