基于智能体奖励反馈的代码美学优化 (Code Aesthetics with Agentic Reward Feedback)

Large Language Models (LLMs) have become valuable assistants for developers in code-related tasks. While LLMs excel at traditional programming tasks such as code generation and bug fixing, they struggle with visually-oriented coding tasks, often producing suboptimal aesthetics. In this paper, we introduce a new pipeline to enhance the aesthetic quality of LLM-generated code. We first construct AesCode-358K, a large-scale instruction-tuning dataset focused on code aesthetics. Next, we propose agentic reward feedback, a multi-agent system that evaluates executability, static aesthetics, and interactive aesthetics. Building on this, we develop GRPO-AR, which integrates these signals into the GRPO algorithm for joint optimization of functionality and code aesthetics. Finally, we develop OpenDesign, a benchmark for assessing code aesthetics. Experimental results show that combining supervised fine-tuning on AesCode-358K with reinforcement learning using agentic reward feedback significantly improves performance on OpenDesign and also enhances results on existing benchmarks such as PandasPlotBench. Notably, our AesCoder-4B surpasses GPT-4o and GPT-4.1, and achieves performance comparable to large open-source models with 480B-685B parameters, underscoring the effectiveness of our approach.

翻译：大型语言模型（LLMs）已成为开发人员在代码相关任务中的重要助手。尽管LLMs在代码生成和错误修复等传统编程任务上表现出色，但在视觉导向的编码任务中往往表现不佳，生成的代码美学质量欠佳。本文提出了一种新的流程来提升LLM生成代码的美学质量。我们首先构建了AesCode-358K，这是一个专注于代码美学的大规模指令微调数据集。其次，我们提出了智能体奖励反馈机制，这是一个多智能体系统，用于评估代码的可执行性、静态美学和交互美学。在此基础上，我们开发了GRPO-AR方法，将这些评估信号整合到GRPO算法中，实现对代码功能性与美学质量的联合优化。最后，我们建立了OpenDesign基准测试，用于评估代码美学质量。实验结果表明，在AesCode-358K上进行监督微调，并结合使用智能体奖励反馈的强化学习，能显著提升模型在OpenDesign基准上的表现，同时也能改善在PandasPlotBench等现有基准上的结果。值得注意的是，我们提出的AesCoder-4B模型超越了GPT-4o和GPT-4.1，其性能可与参数量达480B-685B的大型开源模型相媲美，这充分证明了我们方法的有效性。