Large language models have shown remarkable ability in serial code generation, but they still struggle with parallel code for which training data is comparatively scarce. A common remedy is to use coding agents that interact with external tools, but tool calls can be costly and sometimes impractical, e.g., for partially written code. We propose Parallel-Code World Models (PCWMs), reasoning LLMs that aim to predict tool outcomes directly from parallel source code. To train PCWMs, we design a novel exploration and data generation pipeline that samples diverse parallel-coding problems and candidate implementations across multiple domains, then executes them via tools to record data races and performance profiles. From these, we synthesize reasoning traces that causally connect source code to observed tool outcomes. Fine-tuning on the resulting data yields noticeable gains, with a 7B-parameter world model improving from 64.3% to 72.8% accuracy in race-outcome prediction, while an 8B-parameter model improves in a performance profiling task from 49.3% to 58.6% accuracy. Furthermore, when open-weight models were tasked with fixing data races, world-model feedback improved their race-fixing rates relative to self-feedback by 2.7%-9.1% using our 7B-parameter world model and by 6.1%-11.1% using our 14B-parameter world model. Our results suggest that reasoning world models may have the potential to serve alongside external tool calls in parallel-coding agents.
翻译:大语言模型在串行代码生成方面展现出显著能力,但在训练数据相对稀缺的并行代码任务中仍面临挑战。一种常见解决方案是采用与外部工具交互的编码智能体,但工具调用成本高昂且有时缺乏可行性(例如针对不完整代码)。我们提出并行代码世界模型(PCWMs),这是一种旨在直接从并行源代码预测工具输出结果的推理型大语言模型。为训练PCWMs,我们设计了一套创新的探索与数据生成流程——该流程跨多个领域采样多样化的并行编码问题与候选实现方案,通过工具执行代码以记录数据竞争与性能特征,并据此合成能建立源代码与工具观测结果之间因果关联的推理轨迹。基于生成数据进行的微调带来了显著性能提升:一个7B参数的世界模型在竞态结果预测任务中准确率从64.3%提升至72.8%,而一个8B参数模型在性能分析任务中的准确率从49.3%提升至58.6%。此外,当使用开放权重模型修复数据竞争时,相较于模型自我反馈机制,世界模型反馈使其修复率提升2.7%-9.1%(基于7B参数模型)和6.1%-11.1%(基于14B参数模型)。实验结果表明,推理世界模型有望与外部工具调用在并行编码智能体系统中协同发挥作用。