Information Extraction in Domain and Generic Documents: Findings from Heuristic-based and Data-driven Approaches

Information extraction (IE) plays very important role in natural language processing (NLP) and is fundamental to many NLP applications that used to extract structured information from unstructured text data. Heuristic-based searching and data-driven learning are two main stream implementation approaches. However, no much attention has been paid to document genre and length influence on IE tasks. To fill the gap, in this study, we investigated the accuracy and generalization abilities of heuristic-based searching and data-driven to perform two IE tasks: named entity recognition (NER) and semantic role labeling (SRL) on domain-specific and generic documents with different length. We posited two hypotheses: first, short documents may yield better accuracy results compared to long documents; second, generic documents may exhibit superior extraction outcomes relative to domain-dependent documents due to training document genre limitations. Our findings reveals that no single method demonstrated overwhelming performance in both tasks. For named entity extraction, data-driven approaches outperformed symbolic methods in terms of accuracy, particularly in short texts. In the case of semantic roles extraction, we observed that heuristic-based searching method and data-driven based model with syntax representation surpassed the performance of pure data-driven approach which only consider semantic information. Additionally, we discovered that different semantic roles exhibited varying accuracy levels with the same method. This study offers valuable insights for downstream text mining tasks, such as NER and SRL, when addressing various document features and genres.

翻译：信息抽取在自然语言处理中扮演着至关重要的角色，是许多从非结构化文本数据中提取结构化信息的自然语言处理应用的基础。启发式搜索与数据驱动学习是两种主流的实现方法。然而，目前鲜有研究关注文档体裁和长度对信息抽取任务的影响。为弥补这一空白，本研究探讨了启发式搜索与数据驱动方法在领域特定文档和通用文档（不同长度）上执行命名实体识别和语义角色标注两项信息抽取任务时的准确性与泛化能力。我们提出两个假设：其一，短文档相较于长文档可能获得更优的准确性结果；其二，由于训练文档体裁的限制，通用文档相较于领域相关文档可能展现出更优异的抽取效果。研究结果表明，没有任何单一方法在这两项任务中表现卓越。对于命名实体抽取，数据驱动方法在准确性上优于符号方法，尤其在短文本中表现突出。对于语义角色抽取，我们发现基于启发式搜索的方法和结合句法表示的数据驱动模型，其性能超越了仅考虑语义信息的纯数据驱动方法。此外，我们还发现同一方法对不同语义角色的准确性存在差异。本研究为下游文本挖掘任务（如命名实体识别和语义角色标注）在处理不同文档特征与体裁时提供了宝贵见解。