Utilizing state-of-the-art Large Language Models (LLMs), automatic code generation models play a pivotal role in enhancing the productivity of software development procedures. As the adoption of LLMs becomes more widespread in software coding ecosystems, a pressing issue has emerged: does the generated code contain social bias and unfairness, such as those related to age, gender, and race? This issue concerns the integrity, fairness, and ethical foundation of software applications that depend on the code generated by these models, yet is under-explored in the literature. This paper presents a novel bias testing framework that is specifically designed for code generation tasks. Based on this framework, we conduct an extensive evaluation of the bias in code generated by five state-of-the-art LLMs. Our findings reveal that 20.29% to 44.93% code functions generated by the models under study are biased when handling bias sensitive tasks (i.e., tasks that involve sensitive attributes such as age and gender). This indicates that the existing LLMs can be unfair in code generation, posing risks of unintended and harmful software behaviors. To mitigate bias for code generation models, we evaluate five bias mitigation prompt strategies, i.e., utilizing bias testing results to refine the code (zero-shot), one-, few-shot, and two Chain-of-Thought (CoT) prompts. Our evaluation results illustrate that these strategies are all effective in mitigating bias. Overall, one-shot and few-shot learning are the two most effective. For GPT-4, 80% to 90% code bias can be removed with one-shot learning.
翻译:利用最先进的大型语言模型(LLM),自动代码生成模型在提升软件开发流程的生产力方面发挥着关键作用。随着LLM在软件编码生态系统中日益普及,一个紧迫问题随之浮现:生成的代码是否包含与社会相关的偏见和不公,例如涉及年龄、性别和种族等方面的偏见?这一问题关系到依赖这些模型生成代码的软件应用的完整性、公平性与伦理基础,但在现有文献中尚未得到充分探索。本文提出了一种专为代码生成任务设计的新型偏见测试框架。基于该框架,我们对五种最先进LLM所生成代码中的偏见进行了广泛评估。研究结果表明,在处理涉及敏感属性(如年龄和性别)的偏见敏感任务时,所研究模型生成的代码函数中有20.29%至44.93%存在偏见。这表明现有LLM在代码生成中可能缺乏公平性,会引发意外且有害的软件行为风险。为缓解代码生成模型的偏见,我们评估了五种偏见缓解提示策略,即利用偏见测试结果改进代码(零样本)、单样本、少样本以及两种思维链(CoT)提示。评估结果表明这些策略均能有效缓解偏见。总体而言,单样本与少样本学习是两种最有效的策略。对于GPT-4模型,单样本学习可消除80%至90%的代码偏见。