Open world classification is a task in natural language processing with key practical relevance and impact. Since the open or {\em unknown} category data only manifests in the inference phase, finding a model with a suitable decision boundary accommodating for the identification of known classes and discrimination of the open category is challenging. The performance of existing models is limited by the lack of effective open category data during the training stage or the lack of a good mechanism to learn appropriate decision boundaries. We propose an approach based on \underline{a}daptive \underline{n}egative \underline{s}amples (ANS) designed to generate effective synthetic open category samples in the training stage and without requiring any prior knowledge or external datasets. Empirically, we find a significant advantage in using auxiliary one-versus-rest binary classifiers, which effectively utilize the generated negative samples and avoid the complex threshold-seeking stage in previous works. Extensive experiments on three benchmark datasets show that ANS achieves significant improvements over state-of-the-art methods.
翻译:开放世界分类是自然语言处理中一项具有关键实际意义和影响的任务。由于开放或“未知”类别数据仅在推理阶段出现,寻找一个具有合适决策边界、既能识别已知类别又能区分开放类别的模型极具挑战性。现有模型性能受限于训练阶段缺乏有效的开放类别数据,或缺乏学习合适决策边界的良好机制。我们提出了一种基于自适应负样本(ANS)的方法,该方法能够在训练阶段生成有效的合成开放类别样本,且无需任何先验知识或外部数据集。实验表明,采用辅助的一对多二元分类器具有显著优势,它能有效利用生成的负样本,并避免先前工作中的复杂阈值搜索阶段。在三个基准数据集上的大量实验证明,ANS相较于现有最优方法取得了显著改进。