Real-world data usually exhibits a long-tailed distribution,with a few frequent labels and a lot of few-shot labels. The study of institution name normalization is a perfect application case showing this phenomenon. There are many institutions worldwide with enormous variations of their names in the publicly available literature. In this work, we first collect a large-scale institution name normalization dataset LoT-insts1, which contains over 25k classes that exhibit a naturally long-tailed distribution. In order to isolate the few-shot and zero-shot learning scenarios from the massive many-shot classes, we construct our test set from four different subsets: many-, medium-, and few-shot sets, as well as a zero-shot open set. We also replicate several important baseline methods on our data, covering a wide range from search-based methods to neural network methods that use the pretrained BERT model. Further, we propose our specially pretrained, BERT-based model that shows better out-of-distribution generalization on few-shot and zero-shot test sets. Compared to other datasets focusing on the long-tailed phenomenon, our dataset has one order of magnitude more training data than the largest existing long-tailed datasets and is naturally long-tailed rather than manually synthesized. We believe it provides an important and different scenario to study this problem. To our best knowledge, this is the first natural language dataset that focuses on long-tailed and open-set classification problems.
翻译:真实数据通常呈现长尾分布,即少数高频标签与大量小样本标签并存。机构名称归一化研究正是展现这一现象的典型案例。全球范围内存在众多机构,其公开文献中的名称变体形式多样。本研究首先构建了大规模机构名称归一化数据集LoT-insts1,包含超过25,000个自然呈现长尾分布的类别。为从海量多样本类别中分离出小样本和零样本学习场景,我们设计了包含四个子集的测试集:多样本集、中样本集、小样本集以及零样本开放集。我们在该数据集上复现了多种重要基线方法,涵盖从基于搜索的方法到使用预训练BERT模型的神经网络方法。进一步,我们提出了专门预训练的BERT模型,该模型在小样本和零样本测试集上展现出更优的分布外泛化能力。相较于其他聚焦长尾现象的数据集,本数据集训练数据量比现有最大长尾数据集高出一个数量级,且为自然形成的长尾分布而非人工合成。我们相信该数据集为研究该问题提供了重要且差异化的场景。据我们所知,这是首个聚焦长尾及开放集分类问题的自然语言数据集。