Large language models (LLMs) have exhibited remarkable performance across various tasks in natural language processing. Nevertheless, challenges still arise when these tasks demand domain-specific expertise and advanced analytical skills, such as conducting research surveys on a designated topic. In this research, we develop ResearchArena, a benchmark that measures LLM agents' ability to conduct academic surveys, an initial step of academic research process. Specifically, we deconstructs the surveying process into three stages 1) information discovery: locating relevant papers, 2) information selection: assessing papers' importance to the topic, and 3) information organization: organizing papers into meaningful structures. In particular, we establish an offline environment comprising 12.0M full-text academic papers and 7.9K survey papers, which evaluates agents' ability to locate supporting materials for composing the survey on a topic, rank the located papers based on their impact, and organize these into a hierarchical knowledge mind-map. With this benchmark, we conduct preliminary evaluations of existing techniques and find that all LLM-based methods under-performing when compared to basic keyword-based retrieval techniques, highlighting substantial opportunities for future research.
翻译:大语言模型(LLMs)在自然语言处理的各项任务中展现出卓越性能。然而,当任务需要领域专业知识与高级分析技能时——例如针对特定主题开展研究综述——这些模型仍面临挑战。本研究开发了ResearchArena基准测试,用于衡量LLM代理开展学术综述(学术研究流程的初始步骤)的能力。具体而言,我们将综述过程解构为三个阶段:1)信息发现:定位相关论文;2)信息筛选:评估论文对主题的重要性;3)信息组织:将论文整合为有意义的结构。我们特别构建了一个包含1200万篇全文学术论文与7900篇综述论文的离线环境,用于评估代理在特定主题下:定位综述撰写所需支撑材料、依据影响力对定位论文进行排序、以及将其组织为层次化知识思维导图的能力。通过该基准测试,我们对现有技术进行了初步评估,发现所有基于LLM的方法均逊于基于关键词的基础检索技术,这为未来研究指明了重要改进方向。