As LLMs are increasingly used as judges in code applications, they should be evaluated in realistic interactive settings that capture partial context and ambiguous intent. We present TRACE (Tool for Rubric Analysis in Code Evaluation), a framework that evaluates LLM judges' ability to predict human preferences and automatically extracts rubric items to reveal systematic biases in how humans and models weigh each item. Across three modalities -- chat-based programming, IDE autocompletion, and instructed code editing -- we use TRACE to measure how well LLM judges align with developer preferences. Among 13 different models, the best judges underperform human annotators by 12-23%. TRACE identifies 35 significant sources of misalignment between humans and judges across interaction modalities, the majority of which correspond to existing software engineering code quality criteria. For example, in chat-based coding, judges are biased towards longer code explanations while humans prefer shorter ones. We find significant misalignment on the majority of existing code quality dimensions, showing alignment gaps between LLM judges and human preference in realistic coding applications.
翻译:随着大语言模型(LLM)在代码应用中日益充当评判角色,有必要在捕捉部分上下文和模糊意图的真实交互场景中对其进行评估。我们提出TRACE(代码评估量规分析工具)框架,该框架能够评估LLM评判器预测人类偏好的能力,并自动提取评估量规条目以揭示人类与模型在各条目权重分配上的系统性偏差。通过三种交互模式——基于对话的编程、IDE自动补全和指令性代码编辑——我们运用TRACE测量LLM评判器与开发者偏好的对齐程度。在13个不同模型中,最优评判器的表现仍比人工标注者低12-23%。TRACE在不同交互模式下识别出评判器与人类偏好之间的35个显著偏差源,其中大部分与现有软件工程代码质量标准相对应。例如在对话式编程中,评判器偏向于更长的代码解释文本,而人类则偏好更简短的说明。研究发现现有绝大多数代码质量维度上均存在显著的对齐偏差,揭示了LLM评判器与人类偏好在真实编码应用场景中的对齐鸿沟。