Leveraging Large Language Models for Automated Reproduction of Networking Research Results

Yining Jiang,Yunxin Xu,Wenyun Xu,Yufan Zhu,Tangtang He,Haiying Huang,Letian Zhu,Qingyu Song,Qiang Su,Lizhao You,Lu Tang,Wanjin Feng,Yuchao Zhang,Linghe Kong,Qiao Xiang,Jiwu Shu

from arxiv, 21 pages, 8 figures, 10 tables

Code reproduction is a cornerstone of scientific validity, yet it remains a formidable challenge in computer networking research due to the scarcity of open-source implementations and the complexity of heterogeneous system architectures. While Large Language Models have demonstrated potential in code generation, existing code generation frameworks often fail to address the long-context constraints and intricate logical dependencies required to reproduce network systems from academic papers. To facilitate result reproduction, we introduce \emph{RepLLM}, an end-to-end multi-agent framework designed to automate the transformation of network research into executable code. RepLLM features a novel collaborative architecture comprising four specialized agents -- Content Parsing, Architecture Design, Code Generation, and Audit \& Repair -- coordinated through an explicit \textit{Shared Memory} mechanism to ensure global context consistency. With the enhancement of Chain-of-Thought LLM reasoning and a sandbox-isolated static-dynamic debugging methodology, our framework effectively resolves semantic discrepancies and runtime errors. Extensive evaluations on representative papers from SIGCOMM and NSDI demonstrate that RepLLM significantly outperforms state-of-the-art baselines in generating compile-ready and logically correct systems. Results further demonstrate that RepLLM facilitates the reproduction of 80\% of the original benchmarks with only four hours of human intervention.

翻译：代码复现是科学有效性的基石，但在计算机网络研究中，由于开源实现的稀缺性以及异构系统架构的复杂性，它仍然是一项艰巨的挑战。尽管大型语言模型在代码生成方面已展现出潜力，但现有的代码生成框架往往无法满足从学术论文复现网络系统所需的长上下文约束和复杂的逻辑依赖性。为了促进结果复现，我们引入了 \emph{RepLLM}，一个端到端多智能体框架，旨在将网络研究自动化地转化为可执行代码。RepLLM采用了一种新颖的协作架构，包含四个专门化的智能体——内容解析、架构设计、代码生成以及审计与修复——它们通过一个显式的 \textit{共享内存} 机制进行协调，以确保全局上下文的一致性。通过结合思维链LLM推理增强以及沙盒隔离的静态-动态调试方法，我们的框架有效解决了语义差异和运行时错误。在SIGCOMM和NSDI代表性论文上进行的大量评估表明，RepLLM在生成可编译且逻辑正确的系统方面显著优于最先进的基线方法。结果进一步证明，RepLLM能够在仅需四小时人工干预的情况下，促进80%原始基准测试的复现。