Current research on the advantages and trade-offs of using characters, instead of tokenized text, as input for deep learning models, has evolved substantially. New token-free models remove the traditional tokenization step; however, their efficiency remains unclear. Moreover, the effect of tokenization is relatively unexplored in sequence tagging tasks. To this end, we investigate the impact of tokenization when extracting information from documents and present a comparative study and analysis of subword-based and character-based models. Specifically, we study Information Extraction (IE) from biomedical texts. The main outcome is twofold: tokenization patterns can introduce inductive bias that results in state-of-the-art performance, and the character-based models produce promising results; thus, transitioning to token-free IE models is feasible.
翻译:当前关于在深度学习中采用字符而非分词文本作为输入的优势与权衡的研究已取得显著进展。新型无分词模型省去了传统分词步骤,但其效率仍不明确。此外,分词对序列标注任务的影响相对未被充分探索。为此,我们研究了从文档中提取信息时分词的影响,并对基于子词与基于字符的模型进行了比较分析与研究。具体而言,我们聚焦生物医学文本的信息抽取。主要发现具有双重性:分词模式可引入归纳偏置,从而实现最优性能;而基于字符的模型则展现出可观结果,这表明向无分词信息抽取模型的过渡是可行的。