Generation Probabilities Are Not Enough: Exploring the Effectiveness of Uncertainty Highlighting in AI-Powered Code Completions

Large-scale generative models enabled the development of AI-powered code completion tools to assist programmers in writing code. However, much like other AI-powered tools, AI-powered code completions are not always accurate, potentially introducing bugs or even security vulnerabilities into code if not properly detected and corrected by a human programmer. One technique that has been proposed and implemented to help programmers identify potential errors is to highlight uncertain tokens. However, there have been no empirical studies exploring the effectiveness of this technique-- nor investigating the different and not-yet-agreed-upon notions of uncertainty in the context of generative models. We explore the question of whether conveying information about uncertainty enables programmers to more quickly and accurately produce code when collaborating with an AI-powered code completion tool, and if so, what measure of uncertainty best fits programmers' needs. Through a mixed-methods study with 30 programmers, we compare three conditions: providing the AI system's code completion alone, highlighting tokens with the lowest likelihood of being generated by the underlying generative model, and highlighting tokens with the highest predicted likelihood of being edited by a programmer. We find that highlighting tokens with the highest predicted likelihood of being edited leads to faster task completion and more targeted edits, and is subjectively preferred by study participants. In contrast, highlighting tokens according to their probability of being generated does not provide any benefit over the baseline with no highlighting. We further explore the design space of how to convey uncertainty in AI-powered code completion tools, and find that programmers prefer highlights that are granular, informative, interpretable, and not overwhelming.

翻译：大规模生成模型推动了AI辅助代码补全工具的开发，以帮助程序员编写代码。然而，与其他AI辅助工具类似，AI辅助代码补全并不总是准确的，若未由人类程序员正确检测和纠正，可能会在代码中引入错误甚至安全漏洞。一种已被提出并实施以帮助程序员识别潜在错误的技术是突出显示不确定的标记。然而，目前尚无实证研究探索该技术的有效性，也未针对生成模型背景下不同且尚未达成共识的不确定性概念展开调查。我们探讨了以下问题：在程序员与AI辅助代码补全工具协作时，传达不确定性信息是否能使他们更快速、准确地生成代码？如果是，哪种不确定性度量最符合程序员的需求？通过对30名程序员进行混合方法研究，我们比较了三种条件：仅提供AI系统的代码补全、突出显示底层生成模型生成可能性最低的标记、以及突出显示预测程序员编辑可能性最高的标记。我们发现，突出显示预测编辑可能性最高的标记能实现更快的任务完成和更有针对性的编辑，并受到研究参与者的主观偏好。相比之下，根据生成概率突出显示标记并未比无高亮的基线带来任何优势。我们进一步探讨了如何在AI辅助代码补全工具中传达不确定性的设计空间，发现程序员偏好细粒度、信息丰富、可解释且不令人感到 overwhelmed 的高亮显示。