Entity extraction is critical to the intelligent development of various domains and the construction of knowledge agents. Yet, there is category imbalance problem in documents in some specific domains that some categories of entities are common, while some are rare and scattered. This paper proposes to use Zipf's law to tackle this problem and to promote the performance of entity extraction from documents. Using two forms of Zipf's law, words in the documents are classified into common and rare ones, and then sentences are classified into common and rare ones, and are further processed by text generation models respectively. Rare entities in the generated sentences are labeled with human-designed rules, and serve as a supplement to the raw dataset so as to alleviate the category imbalance problem. A case of extracting entities from technical documents on industrial safety is given and the experiments results on two datasets show the effectiveness of the proposed method.
翻译:实体抽取对于各领域的智能化发展及知识代理的构建至关重要。然而,特定领域文档中存在类别不平衡问题:某些类别的实体频繁出现,而另一些则稀疏分散。本文提出利用齐普夫定律解决该问题,以提升文档实体抽取性能。通过两种形式的齐普夫定律,将文档中的词汇划分为常见词与罕见词,进而将句子划分为常见句与罕见句,并分别通过文本生成模型进行处理。对生成句子中的罕见实体采用人工设计的规则进行标注,作为原始数据集的补充,以缓解类别不平衡问题。本文以工业安全技术文档中的实体抽取为例,在两个数据集上的实验结果表明了所提方法的有效性。