The hallucination of code generation models hinders their applicability to systems requiring higher safety standards. One critical bottleneck in addressing code hallucination is the difficulty of identifying the functional correctness of generated code, due to its unnatural form. We address this core bottleneck by automatically generating unit tests using dynamic code analysis tools, leveraging the \emph{executable nature} of code. Accordingly, we propose \emph{selective code generator} that abstains from uncertain generations -- based on the functional correctness evaluated by generated unit tests -- to theoretically control the correctness among non-abstained answers, \ie the false discovery rate. Finally, we propose to use generated unit tests in evaluation as well as in learning for precise code evaluation, calling this paradigm \emph{FuzzEval}. We demonstrate the efficacy of our method along with the controllability of code hallucination and reasonable selection efficiency.
翻译:代码生成模型的幻觉问题阻碍了其在需要更高安全标准的系统中的应用。解决代码幻觉的一个关键瓶颈在于难以判断生成代码的功能正确性,这源于其非自然的代码形态。我们通过利用代码的**可执行特性**,借助动态代码分析工具自动生成单元测试,从而突破这一核心瓶颈。基于此,我们提出**选择性代码生成器**,该生成器会根据生成的单元测试所评估的功能正确性,主动放弃不确定的生成结果——从而在理论上控制未放弃答案中的正确性,即错误发现率。最后,我们提出将生成的单元测试同时应用于评估和学习过程,以实现精确的代码评估,并将此范式称为**模糊评估**。我们通过实验验证了所提方法的有效性,同时证明了代码幻觉的可控性以及合理的选择效率。