Robust, faithful and harm-free pronoun use for individuals is an important goal for language models as their use increases, but prior work tends to study only one or two of these characteristics at a time. To measure progress towards the combined goal, we introduce the task of pronoun fidelity: given a context introducing a co-referring entity and pronoun, the task is to reuse the correct pronoun later. We present RUFF, a carefully-designed dataset of over 5 million instances to measure robust pronoun fidelity in English, and we evaluate 37 popular large language models across architectures (encoder-only, decoder-only and encoder-decoder) and scales (11M-70B parameters). When an individual is introduced with a pronoun, models can mostly faithfully reuse this pronoun in the next sentence, but they are significantly worse with she/her/her, singular they and neopronouns. Moreover, models are easily distracted by non-adversarial sentences discussing other people; even one additional sentence with a distractor pronoun causes accuracy to drop on average by 34%. Our results show that pronoun fidelity is neither robust, nor due to reasoning, in a simple, naturalistic setting where humans achieve nearly 100% accuracy. We encourage researchers to bridge the gaps we find and to carefully evaluate reasoning in settings where superficial repetition might inflate perceptions of model performance.
翻译:随着语言模型应用的增加,对个体实现稳健、忠实且无伤害的人称代词使用成为重要目标,但先前研究通常仅单独考察这些特性中的一两个。为了衡量向综合目标迈进的进展,我们提出了"代词忠实性"任务:给定引入共指实体和代词的上下文,任务是在后续表述中正确复用该代词。我们构建了RUFF数据集,包含精心设计的500万以上实例,用于衡量英语中鲁棒的代词忠实性,并评估了37个主流大语言模型,涵盖不同架构(仅编码器、仅解码器、编码器-解码器)及规模(1100万至700亿参数)。当个体以代词形式引入时,模型基本能在下一句中忠实复用该代词,但对she/her/her、单数they及新代词的性能显著更差。此外,模型极易被讨论其他人的非对抗性句子干扰;仅增加一个包含干扰代词的单句,准确率平均下降34%。我们的结果表明,在人类可达近100%准确率的简单自然场景中,代词忠实性既不够鲁棒,也并非源于推理。我们鼓励研究者填补发现的差距,并在表面重复可能夸大模型性能感知的场景中审慎评估推理能力。