Do large language models (LLMs) exhibit sociodemographic biases, even when they decline to respond? To bypass their refusal to "speak," we study this research question by probing contextualized embeddings and exploring whether this bias is encoded in its latent representations. We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors. We first validate our probe on three pair preference tasks and thirteen LLMs, where we outperform the word embedding association test (WEAT), a standard approach in testing for implicit association, by a relative 27% in error rate. We also find that word pair preferences are best represented in the middle layers. Next, we transfer probes trained on harmless tasks (e.g., pick the larger number) to controversial ones (compare ethnicities) to examine biases in nationality, politics, religion, and gender. We observe substantial bias for all target classes: for instance, the Mistral model implicitly prefers Europe to Africa, Christianity to Judaism, and left-wing to right-wing politics, despite declining to answer. This suggests that instruction fine-tuning does not necessarily debias contextualized embeddings. Our codebase is at https://github.com/castorini/biasprobe.
翻译:大语言模型(LLM)即使拒绝回答时,是否仍会表现出社会人口学偏见?为了绕过其“拒绝发言”的机制,我们通过探测上下文嵌入向量来研究这一问题,并探究这种偏见是否编码在其潜在表征中。我们提出了一种逻辑斯蒂布拉德利-特里探针,可从单词的隐藏向量中预测LLM的词对偏好。我们首先在三项词对偏好任务和十三个LLM上验证了该探针,在错误率上相对标准的内隐关联测试方法(WEAT)降低了27%。我们还发现词对偏好在中间层表征最佳。接下来,我们将基于无害任务(如比较数字大小)训练好的探针迁移至争议性任务(比较种族差异),以考察国籍、政治、宗教和性别方面的偏见。我们发现所有目标类别均存在显著偏见:例如,Mistral模型虽拒绝回答,却隐性地偏好欧洲而非非洲、基督教而非犹太教、左翼而非右翼政治。这表明指令微调未必能消除上下文嵌入向量的偏见。我们的代码库位于https://github.com/castorini/biasprobe。