Retrieval-Augmented Test Generation: How Far Are We?

Retrieval Augmented Generation (RAG) has advanced software engineering tasks but remains underexplored in unit test generation. To bridge this gap, we investigate the efficacy of RAG-based unit test generation for machine learning (ML/DL) APIs and analyze the impact of different knowledge sources on their effectiveness. We examine three domain-specific sources for RAG: (1) API documentation (official guidelines), (2) GitHub issues (developer-reported resolutions), and (3) StackOverflow Q&As (community-driven solutions). Our study focuses on five widely used Python-based ML/DL libraries, TensorFlow, PyTorch, Scikit-learn, Google JAX, and XGBoost, targeting the most-used APIs. We evaluate four state-of-the-art LLMs -- GPT-3.5-Turbo, GPT-4o, Mistral MoE 8x22B, and Llama 3.1 405B -- across three strategies: basic instruction prompting, Basic RAG, and API-level RAG. Quantitatively, we assess syntactical and dynamic correctness and line coverage. While RAG does not enhance correctness, RAG improves line coverage by 6.5% on average. We found that GitHub issues result in the best improvement in line coverage by providing edge cases from various issues. We also found that these generated unit tests can help detect new bugs. Specifically, 28 bugs were detected, 24 unique bugs were reported to developers, ten were confirmed, four were rejected, and ten are awaiting developers' confirmation. Our findings highlight RAG's potential in unit test generation for improving test coverage with well-targeted knowledge sources. Future work should focus on retrieval techniques that identify documents with unique program states to optimize RAG-based unit test generation further.

翻译：检索增强生成（RAG）技术已推动软件工程任务的发展，但在单元测试生成领域的应用仍待深入探索。为填补这一空白，本研究探讨了基于RAG的机器学习（ML/DL）API单元测试生成的有效性，并分析了不同知识源对其效果的影响。我们考察了三种面向特定领域的RAG知识源：（1）API文档（官方指南），（2）GitHub问题（开发者报告的问题解决方案），（3）StackOverflow问答（社区驱动的解决方案）。本研究聚焦于五个广泛使用的基于Python的ML/DL库——TensorFlow、PyTorch、Scikit-learn、Google JAX和XGBoost，并针对其最常用的API展开分析。我们评估了四种前沿大型语言模型——GPT-3.5-Turbo、GPT-4o、Mistral MoE 8x22B和Llama 3.1 405B——在三种策略下的表现：基础指令提示、基础RAG和API级RAG。在量化评估方面，我们考察了语法正确性、动态正确性以及行覆盖率。虽然RAG并未提升正确性，但平均提高了6.5%的行覆盖率。我们发现，GitHub问题通过提供来自各类问题的边界情况，实现了最佳的行覆盖率改进。此外，这些生成的单元测试有助于检测新缺陷。具体而言，共检测到28个缺陷，向开发者报告了24个独立缺陷，其中10个已获确认，4个被拒绝，10个正等待开发者确认。我们的研究结果凸显了RAG在单元测试生成中，通过精准定位的知识源提升测试覆盖率的潜力。未来工作应聚焦于开发能够识别具有独特程序状态文档的检索技术，以进一步优化基于RAG的单元测试生成。