tasksource: A Dataset Harmonization Framework for Streamlined NLP Multi-Task Learning and Evaluation

The HuggingFace Datasets Hub hosts thousands of datasets, offering exciting opportunities for language model training and evaluation. However, datasets for a specific task type often have different schemas, making harmonization challenging. Multi-task training or evaluation necessitates manual work to fit data into task templates. Several initiatives independently tackle this issue by releasing harmonized datasets or providing harmonization codes to preprocess datasets into a consistent format. We identify patterns across previous preprocessing efforts, such as column name mapping and extracting specific sub-fields from structured data in a column. We then propose a structured annotation framework that ensures our annotations are fully exposed and not hidden within unstructured code. We release a dataset annotation framework and dataset annotations for more than 500 English tasks\footnote{\url{https://github.com/sileod/tasksource}}. These annotations include metadata, such as the names of columns to be used as input or labels for all datasets, which can save time for future dataset preprocessing, regardless of whether our framework is utilized. We fine-tune a multi-task text encoder on all tasksource tasks, outperforming every publicly available text encoder of comparable size in an external evaluation.

翻译：HuggingFace数据集中心托管了数千个数据集，为语言模型训练与评估提供了广阔机遇。然而，特定任务类型的数据集往往具有不同模式，导致协调工作面临挑战。多任务训练或评估需要人工将数据匹配至任务模板中。多个独立研究通过发布协调数据集或提供预处理代码以将数据转化为统一格式的方案来解决此问题。我们梳理了先前预处理工作中的模式特征，例如列名映射与从结构化数据列中提取特定子字段。进而提出一种结构化标注框架，确保标注内容完全可见且不隐藏于非结构化代码中。我们发布了涵盖500余项英文任务的数据集标注框架与标注信息\footnote{\url{https://github.com/sileod/tasksource}}，这些标注包含元数据（如所有数据集中用作输入或标签的列名称），可为未来数据集预处理节省时间（无论是否使用本框架）。基于所有任务源任务对多任务文本编码器进行微调后，在外部评估中其性能优于所有同等规模的公开文本编码器。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日