Relation extraction (RE) aims to extract relations from sentences and documents. Existing relation extraction models typically rely on supervised machine learning. However, recent studies showed that many RE datasets are incompletely annotated. This is known as the false negative problem in which valid relations are falsely annotated as 'no_relation'. Models trained with such data inevitably make similar mistakes during the inference stage. Self-training has been proven effective in alleviating the false negative problem. However, traditional self-training is vulnerable to confirmation bias and exhibits poor performance in minority classes. To overcome this limitation, we proposed a novel class-adaptive re-sampling self-training framework. Specifically, we re-sampled the pseudo-labels for each class by precision and recall scores. Our re-sampling strategy favored the pseudo-labels of classes with high precision and low recall, which improved the overall recall without significantly compromising precision. We conducted experiments on document-level and biomedical relation extraction datasets, and the results showed that our proposed self-training framework consistently outperforms existing competitive methods on the Re-DocRED and ChemDisgene datasets when the training data are incompletely annotated. Our code is released at https://github.com/DAMO-NLP-SG/CAST.
翻译:关系抽取(Relation Extraction, RE)旨在从句子和文档中提取关系。现有关系抽取模型通常依赖监督机器学习。然而,近期研究表明许多RE数据集存在不完全标注问题,即有效关系被错误标注为"无关系"(假阴性问题)。基于此类数据训练的模型在推理阶段不可避免地会犯类似错误。自训练已被证明能有效缓解假阴性问题,但传统自训练易受确认偏误影响,且在少数类上表现欠佳。为克服这一局限,我们提出了一种新颖的类别自适应重采样自训练框架。具体而言,我们通过精确率与召回率分数对每类伪标签进行重采样,该策略优先选择高精确率低召回率类别的伪标签,从而在不显著牺牲精确率的前提下提升整体召回率。我们在文档级和生物医学关系抽取数据集上开展实验,结果表明:当训练数据存在不完全标注时,所提出的自训练框架在Re-DocRED和ChemDisgene数据集上始终优于现有竞争方法。代码已在https://github.com/DAMO-NLP-SG/CAST 开源。