The ever-increasing volume of paper submissions makes it difficult to stay informed about the latest state-of-the-art research. To address this challenge, we introduce LEGOBench, a benchmark for evaluating systems that generate scientific leaderboards. LEGOBench is curated from 22 years of preprint submission data on arXiv and more than 11k machine learning leaderboards on the PapersWithCode portal. We present four graph-based and two language model-based leaderboard generation task configurations. We evaluate popular encoder-only scientific language models as well as decoder-only large language models across these task configurations. State-of-the-art models showcase significant performance gaps in automatic leaderboard generation on LEGOBench. The code is available on GitHub ( https://github.com/lingo-iitgn/LEGOBench ) and the dataset is hosted on OSF ( https://osf.io/9v2py/?view_only=6f91b0b510df498ba01595f8f278f94c ).
翻译:论文投稿量的持续增长使得跟踪最新前沿研究变得愈发困难。为应对这一挑战,我们提出LEGOBench——一个用于评估科学排行榜自动生成系统的基准测试平台。该基准整合了arXiv平台22年来的预印本投稿数据,以及PapersWithCode门户网站超过11000个机器学习排行榜。我们设计了四种基于图结构和两种基于语言模型的排行榜生成任务配置,并评估了流行的编码器专用科学语言模型与解码器专用大语言模型在这些任务配置下的表现。最新模型在LEGOBench的自动排行榜生成任务中展现出显著的性能差距。相关代码已开源GitHub (https://github.com/lingo-iitgn/LEGOBench),数据集托管于OSF平台 (https://osf.io/9v2py/?view_only=6f91b0b510df498ba01595f8f278f94c)。