An Entity-based Claim Extraction Pipeline for Real-world Biomedical Fact-checking

Existing fact-checking models for biomedical claims are typically trained on synthetic or well-worded data and hardly transfer to social media content. This mismatch can be mitigated by adapting the social media input to mimic the focused nature of common training claims. To do so, Wuehrl & Klinger (2022) propose to extract concise claims based on medical entities in the text. However, their study has two limitations: First, it relies on gold-annotated entities. Therefore, its feasibility for a real-world application cannot be assessed since this requires detecting relevant entities automatically. Second, they represent claim entities with the original tokens. This constitutes a terminology mismatch which potentially limits the fact-checking performance. To understand both challenges, we propose a claim extraction pipeline for medical tweets that incorporates named entity recognition and terminology normalization via entity linking. We show that automatic NER does lead to a performance drop in comparison to using gold annotations but the fact-checking performance still improves considerably over inputting the unchanged tweets. Normalizing entities to their canonical forms does, however, not improve the performance.

翻译：现有针对生物医学声明的事实核查模型通常基于合成数据或措辞严谨的数据训练，难以迁移至社交媒体内容。这种不匹配可通过调整社交媒体输入以模拟常见训练声明的聚焦特性来缓解。为此，Wuehrl & Klinger（2022）提出基于文本中的医学实体提取简洁声明。然而其研究存在两处局限：其一，该方法依赖人工标注实体，故无法评估其在真实世界应用中的可行性——因为此类应用需自动检测相关实体；其二，他们使用原始词元表征声明实体，构成术语不匹配问题，可能限制事实核查性能。为理解这两项挑战，我们提出面向医疗推文的声明提取流水线，该流水线整合了命名实体识别与基于实体链接的术语规范化。实验表明：相比使用人工标注，自动NER确实导致性能下降，但相较于直接输入未修改推文，事实核查性能仍显著提升；将实体规范化至规范形式则未带来性能改善。

相关内容

实体

关注 12

实体（entity）是有可区别性且独立存在的某种事物，但它不需要是物质上的存在。尤其是抽象和法律拟制也通常被视为实体。实体可被看成是一包含有子集的集合。在哲学里，这种集合被称为客体。实体可被使用来指涉某个可能是人、动物、植物或真菌等不会思考的生命、无生命物体或信念等的事物。在这一方面，实体可以被视为一全包的词语。有时，实体被当做本质的广义，不论即指的是否为物质上的存在，如时常会指涉到的无物质形式的实体－语言。更有甚者，实体有时亦指存在或本质本身。在法律上，实体是指能具有权利和义务的事物。这通常是指法人，但也包括自然人。

【AAAI2022】上下文感知的词语替换与文本溯源

专知会员服务

18+阅读 · 2022年1月23日

【WWW2021】RETA:一种模式感知的、用于知识图谱中实例补全的端到端解决方案

专知会员服务

22+阅读 · 2021年4月13日

近期必读的五篇 EMNLP 2020【知识图谱补全】相关论文和代码

专知会员服务

65+阅读 · 2020年11月24日

【KDD2020-清华大学】自适应图编码器，Adaptive Graph Encoder for Attributed Graph Embedding

专知会员服务

99+阅读 · 2020年7月6日