In this work, we show a fundamental limitation in vocabulary adaptation approaches that use Byte-Pair Encoding (BPE) tokenization scheme for fine-tuning pretrained language models (PLMs) to expert domains. Current approaches trivially append the target domain-specific vocabulary at the end of the PLM vocabulary. This approach leads to a lower priority score and causes sub-optimal tokenization in BPE that iteratively uses merge rules to tokenize a given text. To mitigate this issue, we propose AdaptBPE where the BPE tokenization initialization phase is modified to first perform the longest string matching on the added (target) vocabulary before tokenizing at the character level. We perform an extensive evaluation of AdaptBPE versus the standard BPE over various classification and summarization tasks; AdaptBPE improves by 3.57% (in terms of accuracy) and 1.87% (in terms of Rouge-L), respectively. AdaptBPE for MEDVOC works particularly well when reference summaries have high OOV concentration or are longer in length. We also conduct a human evaluation, revealing that AdaptBPE generates more relevant and more faithful summaries as compared to MEDVOC. We make our codebase publicly available at https://github.com/gb-kgp/adaptbpe.
翻译:本研究揭示了基于字节对编码(BPE)分词方案的词汇适应方法在将预训练语言模型(PLMs)微调至专业领域时存在的根本性局限。现有方法简单地将目标领域特定词汇附加在PLM词汇表末尾,导致这些词汇获得较低优先级分数,进而使BPE(通过迭代应用合并规则对文本进行分词)产生次优分词结果。为缓解此问题,我们提出AdaptBPE方法,通过修改BPE分词初始化阶段:在字符级分词之前,首先对新增(目标)词汇执行最长字符串匹配。我们在多种分类和摘要任务上对AdaptBPE与标准BPE进行了全面评估:AdaptBPE在准确率方面提升3.57%,在Rouge-L指标上提升1.87%。当参考摘要包含高浓度未登录词或长度较长时,AdaptBPE在MEDVOC任务中表现尤为突出。人工评估进一步表明,相较于MEDVOC,AdaptBPE能生成更具相关性和忠实度的摘要。相关代码已开源:https://github.com/gb-kgp/adaptbpe。