Human-Aligned Enhancement of Programming Answers with LLMs Guided by User Feedback

Large Language Models (LLMs) are widely used to support software developers in tasks such as code generation, optimization, and documentation. However, their ability to improve existing programming answers in a human-like manner remains underexplored. On technical question-and-answer platforms such as Stack Overflow (SO), contributors often revise answers based on user comments that identify errors, inefficiencies, or missing explanations. Yet roughly one-third of this feedback is never addressed due to limited time, expertise, or visibility, leaving many answers incomplete or outdated. This study investigates whether LLMs can enhance programming answers by interpreting and incorporating comment-based feedback. We make four main contributions. First, we introduce ReSOlve, a benchmark consisting of 790 SO answers with associated comment threads, annotated for improvement-related and general feedback. Second, we evaluate four state-of-the-art LLMs on their ability to identify actionable concerns, finding that DeepSeek achieves the best balance between precision and recall. Third, we present AUTOCOMBAT, an LLM-powered tool that improves programming answers by jointly leveraging user comments and question context. Compared to human revised references, AUTOCOMBAT produces near-human quality improvements while preserving the original intent and significantly outperforming the baseline. Finally, a user study with 58 practitioners shows strong practical value, with 84.5 percent indicating they would adopt or recommend the tool. Overall, AUTOCOMBAT demonstrates the potential of scalable, feedback-driven answer refinement to improve the reliability and trustworthiness of technical knowledge platforms.

翻译：大语言模型（LLMs）被广泛用于支持软件开发人员在代码生成、优化和文档编写等任务中。然而，其以类人方式改进现有编程答案的能力仍未得到充分探索。在诸如Stack Overflow（SO）等技术问答平台上，贡献者通常根据用户评论（指出错误、低效或解释缺失）来修订答案。但大约三分之一的此类反馈因时间、专业知识或可见性有限而从未得到处理，导致许多答案不完整或过时。本研究探讨了LLMs能否通过解读并整合基于评论的反馈来增强编程答案。我们做出四项主要贡献。首先，我们提出了ReSOlve基准，该基准包含790个SO答案及相关评论线程，并针对改进相关反馈和一般反馈进行了标注。其次，我们评估了四种前沿LLMs在识别可操作问题方面的能力，发现DeepSeek在精确率与召回率之间取得了最佳平衡。第三，我们提出了AUTOCOMBAT——一种通过协同利用用户评论与问题上下文来改进编程答案的LLM驱动工具。相较于人工修订的参考版本，AUTOCOMBAT在保持原始意图的同时，产生了接近人类质量的改进效果，并显著优于基线方法。最后，一项涉及58名从业者的用户研究表明该工具具有强大的实用价值，其中84.5%的参与者表示愿意采纳或推荐该工具。总体而言，AUTOCOMBAT展示了可扩展的、反馈驱动的答案精炼在提升技术知识平台可靠性与可信度方面的潜力。