In this technical report, we evaluated the performance of the ChatGPT and GPT-3 models for the task of vulnerability detection in code. Our evaluation was conducted on our real-world dataset, using binary and multi-label classification tasks on CWE vulnerabilities. We decided to evaluate the model because it has shown good performance on other code-based tasks, such as solving programming challenges and understanding code at a high level. However, we found that the ChatGPT model performed no better than a dummy classifier for both binary and multi-label classification tasks for code vulnerability detection.
翻译:在本技术报告中,我们评估了ChatGPT与GPT-3模型在代码漏洞检测任务中的性能。评估基于真实世界数据集,针对CWE漏洞类型执行了二分类与多标签分类任务。我们决定评估该模型,是因为它在其他代码相关任务(如解决编程挑战及高层面代码理解)中已展现出优异性能。然而,研究发现ChatGPT模型在代码漏洞检测的二分类与多标签分类任务中表现均不优于哑分类器。