Large language models (LLMs) have become integral to modern software development, enabling automated code generation at scale. However, validating the correctness of LLM-generated code remains a critical and largely unsolved challenge. Existing approaches either rely on dynamic consensus across multiple code candidates - making them costly and difficult to scale - or on static reasoning that is susceptible to dynamic bugs and order bias. In this paper, we propose TRAILS~ (Targeted Reasoning Agreement via Inputs and Specifications), an approach that grounds LLM reasoning with concrete (input, output) pairs. TRAILS~ first generates diverse test inputs via category partitioning based on the specification, then executes them against the candidate code and prompts LLMs to assess whether the resulting input-output pairs conform to the specification - without ever reasoning over the code itself. Scores are aggregated across inputs, to determines whether the program is likely correct. We evaluate TRAILS~ on two datasets, LiveCodeBench and CoCoClaNeL, across three LLMs (Qwen3Coder-30B, Devstral-Small-24B, and Olmo3.1-Instruct), comparing against HoarePrompt and a Zero-Shot Chain-of-Thought baseline. TRAILS~ improves Matthew Correlation Coefficient by up to 39\% relative to Zero-Shot COT and consistently outperforms HoarePrompt. Beyond accuracy, TRAILS~ demonstrates greater stability across seeded runs, reducing sensitivity to LLM non-determinism, and assigns correct labels to a larger set of unique code samples than competing approaches.
翻译:大型语言模型(LLM)已成为现代软件开发的组成部分,支持大规模自动化代码生成。然而,验证LLM生成代码的正确性仍是一项关键且尚未解决的挑战。现有方法要么依赖多个代码候选之间的动态共识——导致成本高昂且难以扩展,要么依赖容易受动态错误和顺序偏差影响的静态推理。本文提出TRAILS~(基于输入和规约的目标推理一致性),该方法利用具体的(输入、输出)对来支撑LLM推理。TRAILS~首先根据规约通过类别划分生成多样化的测试输入,随后使用这些输入执行候选代码,并促使LLM评估生成的输入-输出对是否符合规约——全程无需对代码本身进行推理。最终汇总各输入评分,判定程序是否可能正确。我们在LiveCodeBench和CoCoClaNeL两个数据集上,基于三种LLM(Qwen3Coder-30B、Devstral-Small-24B和Olmo3.1-Instruct)评估了TRAILS~,并与HoarePrompt和零样本思维链基线方法进行比较。相对于零样本思维链,TRAILS~将马修斯相关系数提升了高达39%,且持续优于HoarePrompt。除准确性外,TRAILS~在多次种子运行中表现出更强的稳定性,降低了对LLM非确定性的敏感度,并为更多独立代码样本分配了正确标签。