Large language models (LLMs) have rapidly gained popularity and are being embedded into professional applications due to their capabilities in generating human-like content. However, unquestioned reliance on their outputs and recommendations can be problematic as LLMs can reinforce societal biases and stereotypes. This study investigates how LLMs, specifically OpenAI's GPT-4 and Microsoft Copilot, can reinforce gender and racial stereotypes within the software engineering (SE) profession through both textual and graphical outputs. We used each LLM to generate 300 profiles, consisting of 100 gender-based and 50 gender-neutral profiles, for a recruitment scenario in SE roles. Recommendations were generated for each profile and evaluated against the job requirements for four distinct SE positions. Each LLM was asked to select the top 5 candidates and subsequently the best candidate for each role. Each LLM was also asked to generate images for the top 5 candidates, providing a dataset for analysing potential biases in both text-based selections and visual representations. Our analysis reveals that both models preferred male and Caucasian profiles, particularly for senior roles, and favoured images featuring traits such as lighter skin tones, slimmer body types, and younger appearances. These findings highlight underlying societal biases influence the outputs of LLMs, contributing to narrow, exclusionary stereotypes that can further limit diversity and perpetuate inequities in the SE field. As LLMs are increasingly adopted within SE research and professional practices, awareness of these biases is crucial to prevent the reinforcement of discriminatory norms and to ensure that AI tools are leveraged to promote an inclusive and equitable engineering culture rather than hinder it.
翻译:大语言模型(LLMs)因其生成类人内容的能力而迅速普及,并被嵌入专业应用中。然而,对其输出和推荐不加批判地依赖可能存在问题,因为LLMs可能强化社会偏见和刻板印象。本研究探讨了LLMs,特别是OpenAI的GPT-4和Microsoft Copilot,如何通过文本和图形输出强化软件工程(SE)职业中的性别和种族刻板印象。我们使用每个LLM为SE职位的招聘场景生成了300份个人资料,包括100份基于性别的资料和50份性别中立的资料。针对每份资料生成推荐,并根据四个不同SE职位的职位要求进行评估。要求每个LLM为每个职位选择前5名候选人,随后选择最佳候选人。同时要求每个LLM为前5名候选人生成图像,从而为分析基于文本的选择和视觉表征中的潜在偏见提供了数据集。我们的分析表明,两种模型都更偏好男性和白种人资料,尤其是高级职位,并且青睐具有较浅肤色、较瘦体型和较年轻外貌特征的图像。这些发现突显了潜在的社会偏见如何影响LLMs的输出,助长了狭隘、排他的刻板印象,可能进一步限制SE领域的多样性并延续不平等。随着LLMs在SE研究和专业实践中日益普及,认识这些偏见对于防止强化歧视性规范、确保利用AI工具促进包容和公平的工程文化而非阻碍其发展至关重要。