The impressive success of recent deep neural network (DNN)-based systems is significantly influenced by the high-quality datasets used in training. However, the effects of the datasets, especially how they interact with each other, remain underexplored. We propose a state-vector framework to enable rigorous studies in this direction. This framework uses idealized probing test results as the bases of a vector space. This framework allows us to quantify the effects of both standalone and interacting datasets. We show that the significant effects of some commonly-used language understanding datasets are characteristic and are concentrated on a few linguistic dimensions. Additionally, we observe some ``spill-over'' effects: the datasets could impact the models along dimensions that may seem unrelated to the intended tasks. Our state-vector framework paves the way for a systematic understanding of the dataset effects, a crucial component in responsible and robust model development.
翻译:近期基于深度神经网络(DNN)的系统取得的显著成功,很大程度上得益于训练中使用的高质量数据集。然而,数据集的影响,尤其是它们之间的交互作用,仍未得到充分探索。我们提出了一种状态向量框架,以支持对这一方向的严谨研究。该框架利用理想化探测测试的结果作为向量空间的基,使我们能够量化独立数据集和交互数据集的影响。研究表明,一些常用语言理解数据集的显著影响具有特征性,且集中在少数几个语言维度上。此外,我们观察到一些"溢出"效应:数据集可能沿着与目标任务看似无关的维度影响模型。我们的状态向量框架为系统理解数据集效应——这一负责任且稳健的模型开发中的关键组成部分——铺平了道路。