Learned representations of scientific documents can serve as valuable input features for downstream tasks, without the need for further fine-tuning. However, existing benchmarks for evaluating these representations fail to capture the diversity of relevant tasks. In response, we introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations. It includes 25 challenging and realistic tasks, 11 of which are new, across four formats: classification, regression, ranking and search. We then use the benchmark to study and improve the generalization ability of scientific document representation models. We show how state-of-the-art models struggle to generalize across task formats, and that simple multi-task training fails to improve them. However, a new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance. We experiment with task-format-specific control codes and adapters in a multi-task setting and find that they outperform the existing single-embedding state-of-the-art by up to 1.5 points absolute.
翻译:科学文档的习得表征可作为下游任务的宝贵输入特征,无需进一步微调。然而,现有评估此类表征的基准测试未能全面涵盖相关任务的多样性。为此,我们提出SciRepEval——首个用于训练和评估科学文档表征的综合性基准测试。该测试包含25项具有挑战性且贴近实际的任务(其中11项为新增),涵盖分类、回归、排序和搜索四种格式。我们利用该基准研究并提升科学文档表征模型的泛化能力,揭示当前最先进模型难以跨任务格式泛化,且简单的多任务训练无法改善其表现。然而,一种为每篇文档学习多个嵌入(每个嵌入针对不同格式定制)的新方法可提升性能。我们在多任务设置中实验任务格式特定的控制码与适配器,发现其相较于现有单嵌入最先进方法,绝对性能提升高达1.5个百分点。