Current state-of-the-art models for natural language understanding require a preprocessing step to convert raw text into discrete tokens. This process known as tokenization relies on a pre-built vocabulary of words or sub-word morphemes. This fixed vocabulary limits the model's robustness to spelling errors and its capacity to adapt to new domains. In this work, we introduce a novel open-vocabulary language model that adopts a hierarchical two-level approach: one at the word level and another at the sequence level. Concretely, we design an intra-word module that uses a shallow Transformer architecture to learn word representations from their characters, and a deep inter-word Transformer module that contextualizes each word representation by attending to the entire word sequence. Our model thus directly operates on character sequences with explicit awareness of word boundaries, but without biased sub-word or word-level vocabulary. Experiments on various downstream tasks show that our method outperforms strong baselines. We also demonstrate that our hierarchical model is robust to textual corruption and domain shift.
翻译:当前最先进的自然语言理解模型需通过预处理步骤将原始文本转换为离散令牌。这一称为分词的过程依赖于预构建的词或子词形态素词汇表。固定词汇表限制了模型对拼写错误的鲁棒性及其适应新领域的能力。本文提出了一种新颖的开放词汇语言模型,采用层次化双层级方法:词级与序列级。具体而言,我们设计了词内模块,利用浅层Transformer架构从字符中学习词表示;以及深层词间Transformer模块,通过关注整个词序列对每个词表示进行上下文化。因此,我们的模型直接操作字符序列,显式感知词边界,但无需偏置的子词或词级词汇表。在多种下游任务上的实验表明,我们的方法优于强基线模型。我们还证明了该分层模型对文本损坏和领域迁移具有鲁棒性。