LLMs are increasingly embedded in programming workflows, from code generation to automated code review. Yet, how gendered communication styles interact with LLM-assisted programming and code review remains underexplored. We present a mixed-methods pilot study examining whether gender-related linguistic differences in prompts influence code generation outcomes and code review decisions. Across three complementary studies, we analyze (i) collected real-world coding prompts, (ii) a controlled user study, in which developers solve identical programming tasks with LLM assistance, and (iii) an LLM-based simulated evaluation framework that systematically varies gender-coded prompt styles and reviewer personas. We find that gender-related differences in prompting style are subtle but measurable, with female-authored prompts exhibiting more indirect and involved language, which does not translate into consistent gaps in functional correctness or static code quality. For LLM code review, in contrast, we observe systematic biases: on average, models approve female-authored code more, despite comparable quality. Controlled experiments show that gender-coded prompt style affect code length and maintainability, while reviewer behavior varies across models. Our findings suggest that fairness risks in LLM-assisted programming arise less from generation accuracy than from LLM evaluation, as LLMs are increasingly deployed as automated code reviewers.
翻译:大型语言模型正日益融入编程工作流程,从代码生成到自动化代码审查。然而,性别化沟通风格如何与基于大型语言模型的编程及代码审查相互作用,仍有待深入探究。我们通过一项混合方法的试点研究,考察了提示中与性别相关的语言差异是否会影响代码生成结果和代码审查决策。在三个互补性研究中,我们分析了:(i) 收集的真实编程提示;(ii) 一项受控用户研究,其中开发者借助大型语言模型完成相同的编程任务;(iii) 一个基于大型语言模型的模拟评估框架,该系统性地变化了性别编码的提示风格和审查者角色。我们发现,提示风格中与性别相关的差异微妙但可测量:女性撰写的提示表现出更多间接性和投入性语言,但这并未转化为功能正确性或静态代码质量上的持续差距。相比之下,在大型语言模型的代码审查中,我们观察到系统性偏差:平均而言,模型更倾向于批准女性撰写的代码,尽管代码质量相当。受控实验表明,性别编码的提示风格会影响代码长度和可维护性,而审查者的行为因模型而异。我们的研究结果表明,基于大型语言模型的编程中的公平性风险更少源于生成准确性,而更多源于评估环节,因为大型语言模型正被越来越多地部署为自动化代码审查者。