Artificial intelligence (AI) tools based on large language models have acheived human-level performance on some computer programming tasks. We report several experiments using GPT-4 to generate computer code. These experiments demonstrate that AI code generation using the current generation of tools, while powerful, requires substantial human validation to ensure accurate performance. We also demonstrate that GPT-4 refactoring of existing code can significantly improve that code along several established metrics for code quality, and we show that GPT-4 can generate tests with substantial coverage, but that many of the tests fail when applied to the associated code. These findings suggest that while AI coding tools are very powerful, they still require humans in the loop to ensure validity and accuracy of the results.
翻译:基于大语言模型的人工智能工具已在某些计算机编程任务中达到人类水平的表现。我们报告了使用GPT-4生成计算机代码的多项实验。这些实验表明,当前一代AI工具虽具备强大的代码生成能力,但为确保性能准确性仍需大量人工验证。我们还证明,GPT-4对现有代码的重构能够显著提升代码在多项既定质量指标上的表现,同时显示GPT-4可以生成覆盖率较高的测试用例,但这些测试在应用于相应代码时多数会失败。这些发现表明,尽管AI编程工具功能强大,但仍需人类介入循环以保证结果的有效性和准确性。