In this study, we address the issue of API hallucinations in various software engineering contexts. We introduce CloudAPIBench, a new benchmark designed to measure API hallucination occurrences. CloudAPIBench also provides annotations for frequencies of API occurrences in the public domain, allowing us to study API hallucinations at various frequency levels. Our findings reveal that Code LLMs struggle with low frequency APIs: for e.g., GPT-4o achieves only 38.58% valid low frequency API invocations. We demonstrate that Documentation Augmented Generation (DAG) significantly improves performance for low frequency APIs (increase to 47.94% with DAG) but negatively impacts high frequency APIs when using sub-optimal retrievers (a 39.02% absolute drop). To mitigate this, we propose to intelligently trigger DAG where we check against an API index or leverage Code LLMs' confidence scores to retrieve only when needed. We demonstrate that our proposed methods enhance the balance between low and high frequency API performance, resulting in more reliable API invocations (8.20% absolute improvement on CloudAPIBench for GPT-4o).
翻译:本研究针对各类软件工程场景中的API幻觉问题展开探讨。我们提出了CloudAPIBench——一个用于衡量API幻觉发生频率的新型基准测试集。该基准集同时提供了公开领域中API出现频次的标注信息,使我们能够研究不同频率层级的API幻觉现象。实验结果表明,代码大语言模型在处理低频API时表现欠佳:例如GPT-4o仅能实现38.58%的有效低频API调用。我们证实文档增强生成技术能显著提升低频API的性能(采用DAG后提升至47.94%),但在使用次优检索器时会对高频API产生负面影响(绝对下降39.02%)。为此,我们提出智能触发DAG的解决方案:通过校验API索引或利用代码大语言模型的置信度分数,实现按需检索。实验证明,我们提出的方法能有效平衡低频与高频API的性能表现,从而提升API调用的可靠性(在CloudAPIBench基准上使GPT-4o获得8.20%的绝对性能提升)。