Comparing Variation in Tokenizer Outputs Using a Series of Problematic and Challenging Biomedical Sentences

Background & Objective: Biomedical text data are increasingly available for research. Tokenization is an initial step in many biomedical text mining pipelines. Tokenization is the process of parsing an input biomedical sentence (represented as a digital character sequence) into a discrete set of word/token symbols, which convey focused semantic/syntactic meaning. The objective of this study is to explore variation in tokenizer outputs when applied across a series of challenging biomedical sentences. Method: Diaz [2015] introduce 24 challenging example biomedical sentences for comparing tokenizer performance. In this study, we descriptively explore variation in outputs of eight tokenizers applied to each example biomedical sentence. The tokenizers compared in this study are the NLTK white space tokenizer, the NLTK Penn Tree Bank tokenizer, Spacy and SciSpacy tokenizers, Stanza/Stanza-Craft tokenizers, the UDPipe tokenizer, and R-tokenizers. Results: For many examples, tokenizers performed similarly effectively; however, for certain examples, there were meaningful variation in returned outputs. The white space tokenizer often performed differently than other tokenizers. We observed performance similarities for tokenizers implementing rule-based systems (e.g. pattern matching and regular expressions) and tokenizers implementing neural architectures for token classification. Oftentimes, the challenging tokens resulting in the greatest variation in outputs, are those words which convey substantive and focused biomedical/clinical meaning (e.g. x-ray, IL-10, TCR/CD3, CD4+ CD8+, and (Ca2+)-regulated). Conclusion: When state-of-the-art, open-source tokenizers from Python and R were applied to a series of challenging biomedical example sentences, we observed subtle variation in the returned outputs.

翻译：背景与目的：生物医学文本数据在研究中的可获取性日益增加。分词是许多生物医学文本挖掘流程的初始步骤。分词过程是将输入的生物医学语句（表示为数字字符序列）解析为一组离散的词/标记符号，这些符号传达了集中的语义/句法意义。本研究旨在探讨将分词器应用于一系列具有挑战性的生物医学语句时，其输出结果的差异。方法：Diaz [2015] 引入了24个具有挑战性的生物医学例句，用于比较分词器的性能。在本研究中，我们描述性地探讨了八个分词器应用于每个例句时的输出变化。本研究比较的分词器包括NLTK空白分词器、NLTK 宾夕法尼亚树库分词器、Spacy和SciSpacy分词器、Stanza/Stanza-Craft分词器、UDPipe分词器以及R分词器。结果：对于许多例句，分词器的表现相似且有效；然而，对于某些例句，返回的输出存在显著差异。空白分词器通常与其他分词器的表现不同。我们观察到，基于规则系统（例如模式匹配和正则表达式）实现的分词器与采用神经架构进行标记分类的分词器在性能上具有相似性。通常，导致输出差异最大的挑战性标记是那些传达实质性且集中的生物医学/临床意义的词汇（例如x射线、IL-10、TCR/CD3、CD4+ CD8+和(Ca2+)-调节的）。结论：当将来自Python和R的最先进开源分词器应用于一系列具有挑战性的生物医学例句时，我们观察到返回的输出存在细微差异。