Code readability is crucial for software comprehension and maintenance, yet difficult to assess at scale. Traditional static metrics often fail to capture the subjective, context-sensitive nature of human judgments. Large Language Models (LLMs) offer a scalable alternative, but their behavior as readability evaluators remains underexplored. We introduce CoReEval, the first large-scale benchmark for evaluating LLM-based code readability assessment, comprising over 1.4 million model-snippet-prompt evaluations across 10 state of the art LLMs. The benchmark spans 3 programming languages (Java, Python, CUDA), 2 code types (functional code and unit tests), 4 prompting strategies (ZSL, FSL, CoT, ToT), 9 decoding settings, and developer-guided prompts tailored to junior and senior personas. We compare LLM outputs against human annotations and a validated static model, analyzing numerical alignment (MAE, Pearson's, Spearman's) and justification quality (sentiment, aspect coverage, semantic clustering). Our findings show that developer-guided prompting grounded in human-defined readability dimensions improves alignment in structured contexts, enhances explanation quality, and enables lightweight personalization through persona framing. However, increased score variability highlights trade-offs between alignment, stability, and interpretability. CoReEval provides a robust foundation for prompt engineering, model alignment studies, and human in the loop evaluation, with applications in education, onboarding, and CI/CD pipelines where LLMs can serve as explainable, adaptable reviewers.
翻译:代码可读性对于软件理解与维护至关重要,但难以进行规模化评估。传统静态度量方法往往无法捕捉人类判断的主观性与上下文敏感性。大型语言模型(LLMs)提供了一种可扩展的替代方案,但其作为可读性评估工具的行为机制尚未得到充分探索。我们提出了CoReEval——首个用于评估基于LLM的代码可读性的大规模基准,涵盖10个前沿LLM,累计超过140万次模型-代码片段-提示词三元组评估。该基准覆盖3种编程语言(Java、Python、CUDA)、2种代码类型(功能代码与单元测试)、4种提示策略(零样本学习、少样本学习、思维链、思维树)、9种解码设置,以及针对初级与高级开发者角色定制的开发者引导式提示词。我们将LLM输出与人工标注及已验证的静态模型进行对比,从数值对齐(平均绝对误差、皮尔逊相关系数、斯皮尔曼相关系数)和论证质量(情感倾向、维度覆盖度、语义聚类)两个维度展开分析。研究发现:基于人类定义的可读性维度构建的开发者引导式提示能提升结构化场景下的对齐度,增强解释质量,并通过角色框架实现轻量级个性化。然而,评分变异性的增加揭示了对齐性、稳定性与可解释性之间的权衡关系。CoReEval为提示工程、模型对齐研究和人机协同评估提供了坚实基础,可应用于教育、技术培训和持续集成/持续部署流程等领域,使LLM能够作为可解释、可适配的代码评审工具。