Large language models (LLMs) have been widely deployed in coding tasks, drawing increasing attention to the evaluation of the quality and safety of LLMs' outputs. However, research on bias in code generation remains limited. Existing studies typically identify bias by applying malicious prompts or reusing tasks and dataset originally designed for discriminative models. Given that prior datasets are not fully optimized for code-related tasks, there is a pressing need for benchmarks specifically designed for evaluating code models. In this study, we introduce FairCoder, a novel benchmark for evaluating social bias in code generation. FairCoder explores the bias issue following the pipeline in software development, from function implementation to unit test, with diverse real-world scenarios. Additionally, three metrics are designed to assess fairness performance on this benchmark. We conduct experiments on widely used LLMs and provide a comprehensive analysis of the results. The findings reveal that all tested LLMs exhibit social bias.
翻译:大语言模型(LLMs)已广泛应用于编码任务,这促使人们日益关注对其输出质量和安全性的评估。然而,关于代码生成中偏见的研究仍然有限。现有研究通常通过应用恶意提示或复用最初为判别模型设计的任务和数据集来识别偏见。鉴于先前数据集并未针对代码相关任务进行充分优化,当前亟需专门用于评估代码模型的基准。在本研究中,我们提出了FairCoder,一个用于评估代码生成中社会偏见的新型基准。FairCoder遵循软件开发流程,从函数实现到单元测试,结合多样化的真实场景来探究偏见问题。此外,我们设计了三种指标来评估该基准上的公平性表现。我们对广泛使用的大语言模型进行了实验,并对结果提供了全面分析。研究结果表明,所有测试的大语言模型均表现出社会偏见。