Large Language Models (LLMs) have been increasingly used to optimize code efficiency. Evaluating their effectiveness and further suggesting optimization opportunities often rely on high-quality tests to demonstrate the performance bottlenecks presented in the program. However, existing approaches rely on a limited set of hand-curated inputs or LLM-generated uninteresting length-stressing tests, failing to reveal more nuanced optimization opportunities. We present WEDGE, a framework for generating performance-stressing input given the program under test. WEDGE synthesizes explicit performance-characterizing constraints in the form of branch conditions to partition the programs' execution space into performance-specific regions. When integrated with the coverage-guided fuzzer, reaching different regions introduces explicit rewards for test generation to explore inefficient implementations. Our evaluation shows that WEDGE introduces a significant slowdown compared to the tests in CodeContests and those claimed to be optimized by existing approaches. From the utility perspective, integrating our tests substantially improves the existing code optimization approaches that rely on test-driven execution feedback. We release PERFFORGE, the performance tests generated by WEDGE, to benchmark future approaches for efficient code generation at https://github.com/UChiSeclab/perfforge.
翻译:大型语言模型(LLMs)在代码效率优化中的应用日益广泛。评估其有效性并进一步提出优化机会通常依赖于高质量测试来揭示程序中存在的性能瓶颈。然而,现有方法仅依赖有限的手工筛选输入或LLM生成的简单长度压力测试,未能揭示更细微的优化潜力。本文提出WEDGE框架,该框架能够针对被测程序生成性能压力输入。WEDGE通过合成以分支条件形式呈现的显式性能特征约束,将程序执行空间划分为特定性能区域。当与覆盖率引导的模糊测试工具结合时,触及不同区域会为测试生成引入显式奖励机制,以探索低效实现。评估结果表明,与CodeContests中的测试及现有方法声称优化的测试相比,WEDGE生成的测试能引发更显著的性能降速。从实用角度,集成我们的测试显著提升了依赖测试驱动执行反馈的现有代码优化方法。我们在https://github.com/UChiSeclab/perfforge发布了由WEDGE生成的性能测试集PERFFORGE,为未来高效代码生成方法提供基准测试平台。