Classifying public tenders is a useful task for both companies that are invited to participate and for inspecting fraudulent activities. To facilitate the task for both participants and public administrations, the European Union presented a common taxonomy (\textit{Common Procurement Vocabulary}, CPV) which is mandatory for tenders of certain importance; however, the contracts in which a CPV label is mandatory are the minority compared to all the Public Administrations activities. Classifying over a real-world taxonomy introduces some difficulties that can not be ignored. First of all, some fine-grained classes have an insufficient (if any) number of observations in the training set, while other classes are far more frequent (even thousands of times) than the average. To overcome those difficulties, we present a zero-shot approach, based on a pre-trained language model that relies only on label description and respects the label taxonomy. To train our proposed model, we used industrial data, which comes from \url{contrattipubblici.org}, a service by \href{https://spaziodati.eu}{SpazioDati s.r.l}. that collects public contracts stipulated in Italy in the last 25 years. Results show that the proposed model achieves better performance in classifying low-frequent classes compared to three different baselines, and is also able to predict never-seen classes.
翻译:对公共采购招标进行分类是一项有益的任务,既适用于受邀参与的企业,也适用于欺诈行为的监管。为便利参与主体和公共管理机构的工作,欧盟提出了通用分类体系(通用采购词汇,CPV),并要求具有一定重要性的招标必须采用该分类体系;然而,与所有公共管理活动相比,强制使用CPV标签的合同仅占少数。在现实分类体系上进行分类会引入一些不可忽视的困难。首先,部分细粒度类别在训练集中样本数量不足(甚至为零),而另一些类别的出现频率远高于平均值(甚至高达数千倍)。为克服这些困难,我们提出一种基于预训练语言模型的零样本方法,该方法仅依赖标签描述并遵循标签层级结构。为训练所提模型,我们使用了来自contrattipubblici.org的行业数据,该网站由SpazioDati s.r.l.运营,收集了过去25年意大利境内的公共合同。结果表明,与三种基线方法相比,所提模型在低频类别分类上取得了更优性能,并且能够预测未见过的类别。