Entity extraction is critical in the intelligent advancement across diverse domains. Nevertheless, a challenge to its effectiveness arises from the data imbalance. This paper proposes a novel approach by viewing the issue through the quantitative information, recognizing that entities exhibit certain levels of commonality while others are scarce, which can be reflected in the quantifiable distribution of words. The Zipf's Law emerges as a well-suited adoption, and to transition from words to entities, words within the documents are classified as common and rare ones. Subsequently, sentences are classified into common and rare ones, and are further processed by text generation models accordingly. Rare entities within the generated sentences are then labeled using human-designed rules, serving as a supplement to the raw dataset, thereby mitigating the imbalance problem. The study presents a case of extracting entities from technical documents, and experimental results from two datasets prove the effectiveness of the proposed method. Furthermore, the significance of Zipf's law in driving the progress of AI is discussed, broadening the reach and coverage of Informetrics. This paper presents a successful demonstration of extending Informetrics to interface with AI through Zipf's Law.
翻译:实体抽取在各领域智能化发展过程中至关重要,然而数据不平衡问题对其有效性构成挑战。本文提出一种创新方法,从量化信息角度审视该问题——认识到实体存在不同程度的普遍性与稀缺性,这种特性可通过词语的可量化分布得以体现。齐普夫定律成为理想的理论基础,为完成从词语到实体的过渡,文档中的词汇被划分为常见词与罕见词。随后,句子被分类为常见句与罕见句,并分别采用文本生成模型进行差异化处理。生成句子中的罕见实体通过人工设计规则进行标注,作为原始数据集的补充,从而缓解不平衡问题。本研究以技术文档实体抽取为案例,在两个数据集上的实验结果证明了该方法的有效性。此外,本文还探讨了齐普夫定律在推动人工智能发展中的重要意义,拓展了信息计量学的应用范围与覆盖领域。本文成功展示了通过齐普夫定律将信息计量学与人工智能对接的实践路径。