Execution path reasoning is a key step towards program semantics understanding. It is crucial for generating test cases that cover certain branches/paths, or detecting bugs that are triggered by some paths without actually executing the program. Traditionally, execution path reasoning can be achieved by symbolic execution techniques, but existing SMT-based symbolic execution approaches struggle with complex data structures and external API calls. This challenge is even more pronounced in languages with highly flexible syntax, such as Python, resulting in a lack of widely adopted tools for reasoning on execution paths. Therefore, reasoning execution paths with AI-based approaches become a promising direction. In this paper, we investigate the feasibility of adopting large language models (LLMs) for execution path reasoning on Python, where traditional path-based symbolic execution tools are unavailable. We conduct an empirical study on two types of path reasoning tasks: generation tasks for test case generation and classification tasks for bug detection. We build new evaluation pipelines and benchmarks from both competition-level programs and real-world repositories. Our results show that state-of-the-art LLMs can perform correct reasoning on execution paths and improve test coverage on real-world software, though models with stronger reasoning abilities do not always outperform weaker ones. These findings highlight the potential of utilizing LLMs as a complementary heuristic for path-aware code reasoning, especially in program languages lacking mature symbolic execution tools. We have released our benchmark and evaluation scripts at https://github.com/jacobwwh/llm-path-study.
翻译:暂无翻译