As Large Language Models (LLMs) increasingly generate code in software development, ensuring the quality of LLM-generated code has become important. Traditional testing approaches using Example-based Testing (EBT) often miss edge cases -- defects that occur at boundary values, special input patterns, or extreme conditions. This research investigates the characteristics of LLM-generated Property-based Testing (PBT) compared to EBT for exploring edge cases. We analyze 16 HumanEval problems where standard solutions failed on extended test cases, generating both PBT and EBT test codes using Claude-4-sonnet. Our experimental results reveal that while each method individually achieved a 68.75\% bug detection rate, combining both approaches improved detection to 81.25\%. The analysis demonstrates complementary characteristics: PBT effectively detects performance issues and edge cases through extensive input space exploration, while EBT effectively detects specific boundary conditions and special patterns. These findings suggest that a hybrid approach leveraging both testing methods can improve the reliability of LLM-generated code, providing guidance for test generation strategies in LLM-based code generation.
翻译:随着大型语言模型(LLMs)在软件开发中生成代码的应用日益增多,确保LLM生成代码的质量变得至关重要。传统的基于示例的测试(EBT)方法常常遗漏边界情况——即在边界值、特殊输入模式或极端条件下出现的缺陷。本研究探讨了LLM生成的基于属性的测试(PBT)与EBT在探索边界情况时的特性对比。我们分析了16个HumanEval问题,其中标准解决方案在扩展测试用例上失败,并使用Claude-4-sonnet生成了PBT和EBT测试代码。实验结果表明,尽管每种方法单独实现了68.75%的错误检测率,但结合两种方法可将检测率提升至81.25%。分析揭示了互补特性:PBT通过广泛的输入空间探索有效检测性能问题和边界情况,而EBT则有效检测特定边界条件和特殊模式。这些发现表明,结合两种测试方法的混合策略可以提高LLM生成代码的可靠性,为基于LLM的代码生成中的测试生成策略提供指导。