The emergence of Large Language Models (LLMs) has boosted performance and possibilities in various NLP tasks. While the usage of generative AI models like ChatGPT opens up new opportunities for several business use cases, their current tendency to hallucinate fake content strongly limits their applicability to document analysis, such as information retrieval from documents. In contrast, extractive language models like question answering (QA) or passage retrieval models guarantee query results to be found within the boundaries of an according context document, which makes them candidates for more reliable information extraction in productive environments of companies. In this work we propose an approach that uses and integrates extractive QA models for improved feature extraction of German business documents such as insurance reports or medical leaflets into a document analysis solution. We further show that fine-tuning existing German QA models boosts performance for tailored extraction tasks of complex linguistic features like damage cause explanations or descriptions of medication appearance, even with using only a small set of annotated data. Finally, we discuss the relevance of scoring metrics for evaluating information extraction tasks and deduce a combined metric from Levenshtein distance, F1-Score, Exact Match and ROUGE-L to mimic the assessment criteria from human experts.
翻译:大型语言模型(LLMs)的出现提升了各类自然语言处理任务的性能与潜力。虽然使用ChatGPT等生成式AI模型为多个商业应用场景带来了新机遇,但其当前易产生虚假内容的幻觉倾向严重限制了其应用于文档分析领域(例如从文档中检索信息)。相比之下,诸如问答(QA)或段落检索模型等抽取式语言模型能够保证查询结果严格限定在相应上下文文档范围内,这使得它们成为企业生产环境中更可靠的信息提取候选方案。本研究提出一种方法,通过使用并集成抽取式QA模型,改进德语商业文档(如保险报告或药品说明书)的特征提取,并将其融入文档分析解决方案。我们进一步证明,即使仅使用少量标注数据,对现有德语QA模型进行微调即可提升针对复杂语言特征(如损害原因解释或药物外观描述)的定制化抽取任务性能。最后,我们探讨了评分指标在评估信息抽取任务中的相关性,并从Levenshtein距离、F1分数、精确匹配和ROUGE-L中推导出组合指标,以模拟人类专家的评估标准。