Large language models (LLMs) have garnered significant attention for their remarkable performance in a continuously expanding set of natural language processing tasks. However, these models have been shown to harbor inherent societal biases, or stereotypes, which can adversely affect their performance in their many downstream applications. In this paper, we introduce a novel, purely prompt-based approach to uncover hidden stereotypes within any arbitrary LLM. Our approach dynamically generates a knowledge representation of internal stereotypes, enabling the identification of biases encoded within the LLM's internal knowledge. By illuminating the biases present in LLMs and offering a systematic methodology for their analysis, our work contributes to advancing transparency and promoting fairness in natural language processing systems.
翻译:大型语言模型(LLM)因其在持续扩展的自然语言处理任务中表现出的卓越性能而备受关注。然而,研究表明这些模型内嵌了固有的社会偏见或刻板印象,这可能对其在众多下游应用中的性能产生不利影响。本文提出了一种新颖的、纯基于提示的方法,用于揭示任意LLM中隐藏的刻板印象。我们的方法动态生成内部刻板印象的知识表征,从而能够识别LLM内部知识中编码的偏见。通过阐明LLM中存在的偏见并为其分析提供系统化的方法论,我们的工作有助于提升自然语言处理系统的透明度并促进其公平性。