Detecting vulnerabilities in source code remains a critical yet challenging task, especially when benign and vulnerable functions share significant similarities. In this work, we introduce VulTrial, a courtroom-inspired multi-agent framework designed to enhance automated vulnerability detection. It employs four role-specific agents, which are security researcher, code author, moderator, and review board. Through extensive experiments using GPT-3.5 and GPT-4o we demonstrate that Vultrial outperforms single-agent and multi-agent baselines. Using GPT-4o, VulTrial improves the performance by 102.39% and 84.17% over its respective baseline. Additionally, we show that role-specific instruction tuning in multi-agent with small data (50 pair samples) improves the performance of VulTrial further by 139.89% and 118.30%. Furthermore, we analyze the impact of increasing the number of agent interactions on VulTrial's overall performance. While multi-agent setups inherently incur higher costs due to increased token usage, our findings reveal that applying VulTrial to a cost-effective model like GPT-3.5 can improve its performance by 69.89% compared to GPT-4o in a single-agent setting, at a lower overall cost.
翻译:在源代码中检测漏洞仍然是一项关键且具有挑战性的任务,尤其是当良性函数与易受攻击的函数之间存在显著相似性时。在本工作中,我们引入了VulTrial,一个受法庭启发的多智能体框架,旨在增强自动化漏洞检测能力。该框架采用了四个角色特定的智能体:安全研究员、代码作者、主持人和评审委员会。通过使用GPT-3.5和GPT-4o进行的大量实验,我们证明VulTrial的性能优于单智能体和多智能体基线方法。使用GPT-4o时,VulTrial相较于其各自的基线方法,性能分别提升了102.39%和84.17%。此外,我们展示了在多智能体框架中,使用少量数据(50对样本)进行角色特定的指令微调,能够将VulTrial的性能进一步提升139.89%和118.30%。进一步地,我们分析了增加智能体间交互次数对VulTrial整体性能的影响。虽然多智能体设置由于令牌使用量的增加而固有地导致更高的成本,但我们的研究结果表明,将VulTrial应用于像GPT-3.5这样的高性价比模型,相较于在单智能体设置中使用GPT-4o,能够以更低的总体成本实现69.89%的性能提升。