Large Language Models (LLMs) have excelled at language understanding and generating human-level text. However, even with supervised training and human alignment, these LLMs are susceptible to adversarial attacks where malicious users can prompt the model to generate undesirable text. LLMs also inherently encode potential biases that can cause various harmful effects during interactions. Bias evaluation metrics lack standards as well as consensus and existing methods often rely on human-generated templates and annotations which are expensive and labor intensive. In this work, we train models to automatically create adversarial prompts to elicit biased responses from target LLMs. We present LLM- based bias evaluation metrics and also analyze several existing automatic evaluation methods and metrics. We analyze the various nuances of model responses, identify the strengths and weaknesses of model families, and assess where evaluation methods fall short. We compare these metrics to human evaluation and validate that the LLM-as-a-Judge metric aligns with human judgement on bias in response generation.
翻译:大型语言模型(LLMs)在语言理解和生成类人文本方面表现出色。然而,即使经过监督训练和人类对齐,这些LLMs仍易受到对抗性攻击,恶意用户可通过提示诱导模型生成不良文本。LLMs本身也内在地编码了潜在偏见,这些偏见在交互过程中可能引发多种有害影响。偏见评估指标既缺乏标准也缺乏共识,现有方法通常依赖于人工生成的模板和标注,成本高昂且劳动密集。在本研究中,我们训练模型以自动生成对抗性提示,从而诱发目标LLMs产生带有偏见的回应。我们提出了基于LLM的偏见评估指标,并分析了多种现有的自动评估方法与指标。我们深入剖析模型回应的各种细微差别,识别不同模型系列的优势与不足,并评估现有评估方法的局限之处。我们将这些指标与人工评估进行对比,验证了LLM即评判器指标在回应生成偏见方面与人类判断的一致性。