The HuggingFace Datasets Hub hosts thousands of datasets. This provides exciting opportunities for language model training and evaluation. However, the datasets for a given type of task are stored with different schemas, and harmonization is harder than it seems (https://xkcd.com/927/). Multi-task training or evaluation requires manual work to fit data into task templates. Various initiatives independently address this problem by releasing the harmonized datasets or harmonization codes to preprocess datasets to the same format. We identify patterns across previous preprocessings, e.g. mapping of column names, and extraction of a specific sub-field from structured data in a column, and propose a structured annotation framework that makes our annotations fully exposed and not buried in unstructured code. We release a dataset annotation framework and dataset annotations for more than 400 English tasks (https://github.com/sileod/tasksource). These annotations provide metadata, like the name of the columns that should be used as input or labels for all datasets, and can save time for future dataset preprocessings, even if they do not use our framework. We fine-tune a multi-task text encoder on all tasksource tasks, outperforming every publicly available text encoder of comparable size on an external evaluation https://hf.co/sileod/deberta-v3-base-tasksource-nli.
翻译:HuggingFace数据集中心托管了数千个数据集,这为语言模型训练与评估提供了极佳机遇。然而,给定任务类型的数据集以不同模式存储,且数据协调工作比表面更复杂(https://xkcd.com/927/)。多任务训练或评估需要手动将数据适配至任务模板。现有多种独立方案通过发布已协调数据集或预处理代码(将数据集统一为相同格式)来解决该问题。我们识别出既往预处理中的通用模式(如列名映射、从结构数据列中提取特定子字段),并提出一种结构化标注框架,使我们的标注完全公开且不埋藏在非结构化代码中。我们发布了包含400余项英文任务的数据集标注框架及其标注内容(https://github.com/sileod/tasksource)。这些标注提供元数据(如所有数据集应作为输入或标签使用的列名),可为未来数据集预处理节省时间——即便使用者不采用我们的框架。我们在所有tasksource任务上微调了多任务文本编码器,其在外部评估中超越了同等规模的所有公开文本编码器(https://hf.co/sileod/deberta-v3-base-tasksource-nli)。