Learned representations of scientific documents can serve as valuable input features for downstream tasks without further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 24 challenging and realistic tasks, 8 of which are new, across four formats: classification, regression, ranking and search. We then use this benchmark to study and improve the generalization ability of scientific document representation models. We show how state-of-the-art models like SPECTER and SciNCL struggle to generalize across the task formats, and that simple multi-task training fails to improve them. However, a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance. We experiment with task-format-specific control codes and adapters and find they outperform the existing single-embedding state-of-the-art by over 2 points absolute. We release the resulting family of multi-format models, called SPECTER2, for the community to use and build on.
翻译:科学文献的已学习表示可作为下游任务的输入特征,而无需进一步微调。然而,现有评估这些表示的基准测试未能涵盖相关任务的多样性。为此,我们提出了SciRepEval——首个用于训练和评估科学文献表示的综合性基准测试。该基准包含24项具有挑战性的现实任务(其中8项为新任务),涵盖分类、回归、排序和搜索四种格式。我们利用此基准研究并提升科学文献表示模型的泛化能力。研究表明,SPECTER和SciNCL等最先进模型难以跨任务格式泛化,且简单的多任务训练无法改善其性能。然而,一种为每个文档学习多个嵌入(每个嵌入针对不同格式定制)的新方法可提升性能。我们实验了特定任务格式的控制码与适配器,发现其性能比现有单嵌入最先进方法高出2个绝对点。我们发布了由此产生的多格式模型家族SPECTER2,以供社区使用和扩展。