As language models continue to be integrated into applications of personal and societal relevance, ensuring these models' trustworthiness is crucial, particularly with respect to producing consistent outputs regardless of sensitive attributes. Given that first names may serve as proxies for (intersectional) socio-demographic representations, it is imperative to examine the impact of first names on commonsense reasoning capabilities. In this paper, we study whether a model's reasoning given a specific input differs based on the first names provided. Our underlying assumption is that the reasoning about Alice should not differ from the reasoning about James. We propose and implement a controlled experimental framework to measure the causal effect of first names on commonsense reasoning, enabling us to distinguish between model predictions due to chance and caused by actual factors of interest. Our results indicate that the frequency of first names has a direct effect on model prediction, with less frequent names yielding divergent predictions compared to more frequent names. To gain insights into the internal mechanisms of models that are contributing to these behaviors, we also conduct an in-depth explainable analysis. Overall, our findings suggest that to ensure model robustness, it is essential to augment datasets with more diverse first names during the configuration stage.
翻译:随着语言模型持续融入个人与社会相关应用,确保这些模型的可信度至关重要,尤其是在不因敏感属性而产生输出不一致方面。鉴于名字可能隐喻(交叉性)社会人口统计表征,探究名字对常识推理能力的影响势在必行。本文研究了给定特定输入时,模型推理结果是否因名字不同而存在差异。我们的基本假设是:对爱丽丝的推理不应有别于对詹姆斯的推理。我们提出并实施了一项受控实验框架,以衡量名字对常识推理的因果效应,从而区分模型预测是源于偶然还是实际因素。结果表明,名字频次对模型预测具有直接效应:与高频名字相比,低频名字产生了分歧性预测。为深入探究导致这些行为的模型内部机制,我们还开展了详尽的可解释分析。总体而言,我们的发现表明,为确保模型鲁棒性,在配置阶段必须用更多样的名字扩充数据集。