Large Language Models (LLMs) excel at code generation but struggle with complex problems. Retrieval-Augmented Generation (RAG) mitigates this issue by integrating external knowledge, yet retrieval models often miss relevant context, and generation models hallucinate with irrelevant data. We propose Programming Knowledge Graph (PKG) for semantic representation and fine-grained retrieval of code and text. Our approach enhances retrieval precision through tree pruning and mitigates hallucinations via a re-ranking mechanism that integrates non-RAG solutions. Structuring external data into finer-grained nodes improves retrieval granularity. Evaluations on HumanEval and MBPP show up to 20% pass@1 accuracy gains and a 34% improvement over baselines on MBPP. Our findings demonstrate that our proposed PKG approach along with re-ranker effectively address complex problems while maintaining minimal negative impact on solutions that are already correct without RAG. The replication package is published at https://github.com/iamshahd/ProgrammingKnowledgeGraph
翻译:大型语言模型(LLM)在代码生成方面表现出色,但在处理复杂问题时仍面临挑战。检索增强生成(RAG)通过整合外部知识缓解了这一问题,然而检索模型常常遗漏相关上下文,而生成模型则可能因无关数据产生幻觉。我们提出了编程知识图谱(PKG),用于代码和文本的语义表示与细粒度检索。我们的方法通过树剪枝提高了检索精度,并通过集成非RAG解决方案的重新排序机制减轻了幻觉效应。将外部数据结构化为更细粒度的节点提升了检索的粒度。在HumanEval和MBPP上的评估显示,pass@1准确率最高提升20%,在MBPP上较基线提升34%。我们的研究结果表明,所提出的PKG方法结合重新排序器能有效处理复杂问题,同时对无需RAG已正确的解决方案保持最小的负面影响。复现包已发布于https://github.com/iamshahd/ProgrammingKnowledgeGraph