Recently, researchers have made considerable improvements in dialogue systems with the progress of large language models (LLMs) such as ChatGPT and GPT-4. These LLM-based chatbots encode the potential biases while retaining disparities that can harm humans during interactions. The traditional biases investigation methods often rely on human-written test cases. However, these test cases are usually expensive and limited. In this work, we propose a first-of-its-kind method that automatically generates test cases to detect LLMs' potential gender bias. We apply our method to three well-known LLMs and find that the generated test cases effectively identify the presence of biases. To address the biases identified, we propose a mitigation strategy that uses the generated test cases as demonstrations for in-context learning to circumvent the need for parameter fine-tuning. The experimental results show that LLMs generate fairer responses with the proposed approach.
翻译:近年来,随着ChatGPT和GPT-4等大型语言模型(LLMs)的进步,研究者们在对话系统方面取得了显著改进。这些基于LLM的聊天机器人编码了潜在的偏见,同时保留了在交互过程中可能对人类造成伤害的差异。传统的偏见研究方法通常依赖人工编写的测试用例。然而,这些测试用例往往成本高昂且数量有限。本研究提出了一种首创的方法,可自动生成测试用例以检测LLMs的潜在性别偏见。我们将该方法应用于三个知名LLM,发现生成的测试用例能有效识别偏见的存现。针对已识别的偏见,我们提出一种缓解策略:将生成的测试用例作为上下文学习中的示例,从而避免参数微调的需求。实验结果表明,采用所提方法后,LLM能够生成更公平的回应。