The HuggingFace Datasets Hub hosts thousands of datasets, offering exciting opportunities for language model training and evaluation. However, datasets for a specific task type often have different schemas, making harmonization challenging. Multi-task training or evaluation necessitates manual work to fit data into task templates. Several initiatives independently tackle this issue by releasing harmonized datasets or providing harmonization codes to preprocess datasets into a consistent format. We identify patterns across previous preprocessing efforts, such as column name mapping and extracting specific sub-fields from structured data in a column. We then propose a structured annotation framework that ensures our annotations are fully exposed and not hidden within unstructured code. We release a dataset annotation framework and dataset annotations for more than 500 English tasks\footnote{\url{https://github.com/sileod/tasksource}}. These annotations include metadata, such as the names of columns to be used as input or labels for all datasets, which can save time for future dataset preprocessing, regardless of whether our framework is utilized. We fine-tune a multi-task text encoder on all tasksource tasks, outperforming every publicly available text encoder of comparable size in an external evaluation.
翻译:HuggingFace数据集中心托管了数千个数据集,为语言模型训练与评估提供了广阔机遇。然而,特定任务类型的数据集往往具有不同模式,导致协调工作面临挑战。多任务训练或评估需要人工将数据匹配至任务模板中。多个独立研究通过发布协调数据集或提供预处理代码以将数据转化为统一格式的方案来解决此问题。我们梳理了先前预处理工作中的模式特征,例如列名映射与从结构化数据列中提取特定子字段。进而提出一种结构化标注框架,确保标注内容完全可见且不隐藏于非结构化代码中。我们发布了涵盖500余项英文任务的数据集标注框架与标注信息\footnote{\url{https://github.com/sileod/tasksource}},这些标注包含元数据(如所有数据集中用作输入或标签的列名称),可为未来数据集预处理节省时间(无论是否使用本框架)。基于所有任务源任务对多任务文本编码器进行微调后,在外部评估中其性能优于所有同等规模的公开文本编码器。