In this project, we want to explore the newly emerging field of prompt engineering and apply it to the downstream task of detecting LM biases. More concretely, we explore how to design prompts that can indicate 4 different types of biases: (1) gender, (2) race, (3) sexual orientation, and (4) religion-based. Within our project, we experiment with different manually crafted prompts that can draw out the subtle biases that may be present in the language model. We apply these prompts to multiple variations of popular and well-recognized models: BERT, RoBERTa, and T5 to evaluate their biases. We provide a comparative analysis of these models and assess them using a two-fold method: use human judgment to decide whether model predictions are biased and utilize model-level judgment (through further prompts) to understand if a model can self-diagnose the biases of its own prediction.
翻译:在本项目中,我们旨在探索提示工程这一新兴领域,并将其应用于检测语言模型(LM)偏见的下游任务。具体而言,我们探究如何设计能够识别四类不同偏见的提示:基于性别(gender)、种族(race)、性取向(sexual orientation)及宗教(religion)的偏见。我们实验了多种人工设计的提示,以揭示语言模型中可能存在的微妙偏见。我们将这些提示应用于BERT、RoBERTa和T5等广泛认可模型的多种变体,评估其偏见程度。我们提供了这些模型的对比分析,并采用双重评估方法:通过人类判断确定模型预测是否存在偏见,同时利用模型自身判断(经由额外提示)了解模型能否自我诊断其预测的偏见。