Large language models (LLMs) can generate programs that pass unit tests, but passing tests does not guarantee reliable runtime behavior. We find that different correct solutions to the same task can show very different memory and performance patterns, which can lead to hidden operational risks. We present a framework to measure execution-time memory stability across multiple correct generations. At the solution level, we introduce Dynamic Mean Pairwise Distance (DMPD), which uses Dynamic Time Warping to compare the shapes of memory-usage traces after converting them into Monotonic Peak Profiles (MPPs) to reduce transient noise. Aggregating DMPD across tasks yields a model-level Model Instability Score (MIS). Experiments on BigOBench and CodeContests show substantial runtime divergence among correct solutions. Instability often increases with higher sampling temperature even when pass@1 improves. We also observe correlations between our stability measures and software engineering indicators such as cognitive and cyclomatic complexity, suggesting links between operational behavior and maintainability. Our results support stability-aware selection among passing candidates in CI/CD to reduce operational risk without sacrificing correctness. Artifacts are available.
翻译:大型语言模型(LLM)能够生成通过单元测试的程序,但通过测试并不能保证可靠的运行时行为。我们发现,同一任务的不同正确解决方案可能表现出截然不同的内存和性能模式,这可能带来隐藏的运行风险。我们提出了一个框架,用于测量多个正确生成代码在运行时的内存稳定性。在解决方案层面,我们引入了动态平均成对距离(DMPD),该方法在将内存使用轨迹转换为单调峰值轮廓(MPP)以减少瞬态噪声后,利用动态时间规整技术来比较这些轨迹的形状。将DMPD跨任务聚合,可得到模型层面的模型不稳定分数(MIS)。在BigOBench和CodeContests上的实验表明,正确解决方案之间存在显著的运行时差异。即使pass@1指标有所改善,不稳定性也常常随着采样温度的升高而增加。我们还观察到,我们的稳定性度量与认知复杂度和圈复杂度等软件工程指标之间存在相关性,这表明运行行为与可维护性之间存在联系。我们的研究结果支持在持续集成/持续部署(CI/CD)过程中,从通过测试的候选代码中基于稳定性进行选择,从而在不牺牲正确性的前提下降低运行风险。相关实验材料已公开提供。