Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (Common Procurement Vocabulary, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach, based on a pre-trained language model that relies only on label description and respects the label taxonomy. To train our proposed model, we used industrial data, which comes from contrattipubblici.org, a service by SpazioDati s.r.l. that collects public contracts stipulated in Italy in the last 25 years. Results show that the proposed model achieves better performance in classifying low-frequent classes compared to three different baselines, and is also able to predict never-seen classes.
翻译:对公共招标进行分类,既有助于受邀参与的企业,也有利于监管欺诈活动。为便利参与方和公共管理部门的工作,欧盟提出了一个强制性分类体系(公共采购词汇,CPV),适用于特定重要程度的招标;然而,在所有公共管理活动中,要求标注CPV标签的合同仅占少数。在现实世界的分类体系上进行分类会带来一些不容忽视的困难。首先,某些细粒度类别在训练集中的观测数量不足(甚至没有),而其他类别的出现频率则远高于平均水平(甚至高出数千倍)。为克服这些困难,我们提出了一种基于预训练语言模型的零样本方法,该方法仅依赖标签描述并遵循标签的层次结构。为训练所提出的模型,我们使用了来自contrattipubblici.org的工业数据,该服务由SpazioDati s.r.l.提供,收集了过去25年意大利签订的公共合同。结果表明,与三种不同的基线模型相比,所提出的模型在分类低频类别方面取得了更好的性能,并且能够预测从未见过的类别。