In software engineering research, the primary outcome is frequently a tool. However, for practitioners and academics alike, it is hard to tell which tools are maintained and do they work out of the box. In this paper, we propose a pipeline to identify relevant studies with LLM screening, extract the tools presented in them, and run them with LLM-based coding agent. To evaluate the feasibility of our approach we focus on software log anomaly detection tools. We begin the study by designing a broad search string that yields 3233 hits from Scopus. We request two LLMs to provide an inclusion probability for each title-abstract pair according to the inclusion and exclusion criteria. From the 3233 exported abstracts, this screening reduced the number of included papers to 569, out of which we could download 470. These papers included 206 unique links and after manual evaluation we determined 83 to be tools. Finally, we ran the LLM-based coding agent on these 83 links, and got 24 successfully running tools. As replicating our approach would require roughly only 4 hours of human effort, of which 3 hours were manual PDF downloading, and 12 hours of LLM running time, this demonstrates promising efficiency when utilizing LLMs in rapid reviews. Because practitioner-built tools often lack academic papers, in the future we aim to expand our analysis to tool-hosting platforms such as GitHub and PyPI. In the future, we plan to formalize our workflow as LLM Agent Skills to make our approach easier to adopt.
翻译:在软件工程研究中,主要产出通常是工具。然而,对于从业者和学术界人士而言,很难判断哪些工具得到维护并能直接使用。本文提出一个流程:利用大语言模型筛查筛选相关研究,提取其中呈现的工具,并借助基于大语言模型的编码代理运行这些工具。为评估我们方法的可行性,我们聚焦于软件日志异常检测工具。研究伊始,我们设计了一条宽泛的检索字符串,从Scopus数据库中获得3233条结果。我们请求两个大语言模型根据纳入与排除标准,为每对标题-摘要提供纳入概率。在3233篇导出的摘要中,此次筛选将纳入论文数量减少至569篇,其中可下载470篇。这些论文包含206个唯一链接,经人工评估后,我们确定83个为工具。最后,我们在这83个链接上运行基于大语言模型的编码代理,成功获取24个可运行工具。由于复现我们的方法仅需大约4小时人工(其中3小时为手动PDF下载)和12小时大语言模型运行时间,这证明了在大语言模型辅助的快速综述中利用其具有高效潜力。由于从业者构建的工具往往缺乏学术论文,未来我们计划将分析扩展到GitHub和PyPI等工具托管平台。此外,我们计划将工作流形式化为大语言模型代理技能,以提升方法易用性。