The path toward autonomous software engineering is currently bottlenecked by a severe deficit of diverse, large-scale trajectory data. We address this by introducing \ourdataset, an expansive dataset of 207,489 agentic trajectories spanning nine programming languages (Python, Go, TS, JS, Rust, Java, PHP, C, C++). Sourced from 20,000 real-world PRs via OpenHands and SWE-agent harnesses, the dataset utilizes a hybrid-reasoning synthesis: Minimax-M2.5 generates trajectories with explicit "thinking" processes, while Qwen3.5-122B provides high-quality "non-thinking" traces. Filtered for permissive licenses (MIT, Apache, BSD) from SWE-rebench-V2, this data facilitates the training of models capable of long-horizon reasoning. We validate the dataset by fine-tuning the Qwen3-30B-A3B series (Thinking, Instruct, and Coder). The best performing model achieves resolve rates of 61.7% on SWE-bench Verified, 57.1% on SWE-bench Multilingual, and 36.8% on SWE-bench Pro. These results establish Open-SWE-Traces as a premier resource for distilling human-level software engineering capabilities into efficient, open-source agentic LLMs.
翻译:自主软件工程的发展之路目前受制于多样化、大规模轨迹数据的严重匮乏。我们通过引入Open-SWE-Traces数据集来解决这一难题,该数据集包含跨越九种编程语言(Python、Go、TS、JS、Rust、Java、PHP、C、C++)的207,489条智能体轨迹。数据源自OpenHands和SWE-agent工具框架处理的20,000个真实世界拉取请求,采用混合推理合成策略:Minimax-M2.5生成带有显式"思考"过程的轨迹,而Qwen3.5-122B则提供高质量的"无思考"轨迹。经从SWE-rebench-V2中筛选出宽松许可协议(MIT、Apache、BSD)后,该数据支持训练具备长程推理能力的模型。我们通过微调Qwen3-30B-A3B系列(包括Thinking、Instruct和Coder)模型来验证数据集性能:最佳模型在SWE-bench Verified上达到61.7%的解决率,在SWE-bench Multilingual上达到57.1%,在SWE-bench Pro上达到36.8%。这些结果确立了Open-SWE-Traces作为将人类级软件工程能力蒸馏至高效开源智能体大语言模型的首要资源。