Large Language Models (LLMs) have become increasingly popular for coding tasks, with subjective coding preferences being an essential element to adapt to programmers' personal needs. Existing work overlooks such characteristics and mainly focuses on code correctness. In this study, we propose a typification of four subjective coding preference axes - complexity, commenting, modularity, and readability - motivated by common engineering habits and validated by 25 software engineers. We collect a dataset of ~3,000 paired Python code snippets reflecting these axes, annotated by 73 experts who rate their preferences on a Likert scale. Using our dataset, we study how LLMs handle subjective coding preferences. We present 13 LLMs with pairs of solutions to the same programming task, first as textual descriptions and then as concrete code snippets. We find that models often prefer one option in natural language but the opposite when evaluating code. More consistent models (i.e., those that are coherent in their choices between deeds and words) frequently reveal positional bias: swapping the order of options changes the preferred alternative. We then use the five most consistent models to re-annotate the dataset. Compared to humans, models show polarized Likert distributions and notable divergence in ratings. A case study on GPT-5 reveals reliance on external assumptions and brittle reasoning.
翻译:大型语言模型(LLMs)在编程任务中日益流行,而主观代码偏好是适应程序员个人需求的关键要素。现有研究忽视了这一特性,主要聚焦于代码正确性。在本研究中,我们基于常见工程习惯,提出了一种包含四个主观代码偏好维度的类型学框架——复杂性、注释性、模块化和可读性——该框架经25名软件工程师验证。我们收集了约3000对反映这些维度的Python代码片段数据集,由73名专家采用李克特量表对偏好进行标注。利用该数据集,我们研究了LLMs如何处理主观代码偏好。我们向13个LLM呈现同一编程任务的成对解决方案,首先以文本描述形式,随后以具体代码片段形式。结果发现,模型在自然语言中常偏好某一选项,但在评估代码时却选择相反选项。更一致的模型(即在行为与语言选择上保持连贯的模型)频繁表现出位置偏差:选项顺序的交换会改变其偏好选择。随后,我们使用五个最一致的模型对数据集进行重新标注。与人类相比,模型显示出极化的李克特分布,并在评分上存在显著分歧。针对GPT-5的案例研究表明,其依赖外部假设且推理脆弱。