Infusing clinical knowledge into tokenisers for language models

This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text analytics tasks including clinical concept and relation extraction, automated clinical coding, clinical phenotype identification, and clinical research article classification. Overall, our models demonstrate consistent improvements over their counterparts in all tasks. In particular, substantial improvements are observed in the automated clinical coding task with 13\% increase on Micro $F_1$ score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker converge of language models. Specifically, using K-Tokeniser, the language models would only require 50\% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20\% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable.

翻译：本研究提出了一种新颖的知识增强分词机制K-Tokeniser，用于临床文本处理。在技术层面，K-Tokeniser在初始化阶段基于领域概念（如药物或疾病）的语义类型，从统一医学语言系统等领域本体或任务相关语料的训练数据中，构建词汇的全局表征。在训练或推理阶段，系统将利用句子层面的局部上下文选择最优的全局词汇表征，实现基于语义的分词。为避免使用新分词器进行预训练，本研究提出了一种嵌入初始化方法，用于生成新词汇的表征。基于三种Transformer架构的语言模型，我们在四个真实世界数据集上进行了全面实验，以评估K-Tokeniser在广泛临床文本分析任务中的性能，包括临床概念与关系抽取、自动化临床编码、临床表型识别及临床研究文献分类。总体而言，我们的模型在所有任务中均展现出相较于基线模型的持续改进。特别是在自动化临床编码任务中，我们观察到Micro $F_1$分数显著提升了13%。此外，K-Tokeniser还展现出显著加速语言模型收敛的能力。具体而言，在概念抽取任务中，使用K-Tokeniser的语言模型仅需50%的训练数据即可达到基线分词器使用全部数据的最佳性能；在自动化编码任务中，所需数据量更低于20%。值得注意的是，所有这些改进均无需预训练过程，使得该方法具备良好的可推广性。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【AI应用】Facebook-利用神经网络求解高等数学方程, Using neural networks to solve advanced mathematics equations

专知会员服务

34+阅读 · 2020年1月15日