基于ModernBERT的专利语言模型预训练 (Patent Language Model Pretraining with ModernBERT)

Transformer-based language models such as BERT have become foundational in NLP, yet their performance degrades in specialized domains like patents, which contain long, technical, and legally structured text. Prior approaches to patent NLP have primarily relied on fine-tuning general-purpose models or domain-adapted variants pretrained with limited data. In this work, we pretrain 3 domain-specific masked language models for patents, using the ModernBERT architecture and a curated corpus of over 60 million patent records. Our approach incorporates architectural optimizations, including FlashAttention, rotary embeddings, and GLU feed-forward layers. We evaluate our models on four downstream patent classification tasks. Our model, ModernBERT-base-PT, consistently outperforms the general-purpose ModernBERT baseline on three out of four datasets and achieves competitive performance with a baseline PatentBERT. Additional experiments with ModernBERT-base-VX and Mosaic-BERT-large demonstrate that scaling the model size and customizing the tokenizer further enhance performance on selected tasks. Notably, all ModernBERT variants retain substantially faster inference over - 3x that of PatentBERT - underscoring their suitability for time-sensitive applications. These results underscore the benefits of domain-specific pretraining and architectural improvements for patent-focused NLP tasks.

翻译：基于Transformer的语言模型（如BERT）已成为自然语言处理领域的基础模型，但其在专利等专业领域的性能会下降，因为专利文本具有篇幅长、技术性强且法律结构严谨的特点。以往的专利自然语言处理方法主要依赖于对通用模型进行微调，或使用有限数据预训练的领域适应变体。本研究采用ModernBERT架构，利用超过6000万条精选专利记录构建的语料库，预训练了三个面向专利领域的掩码语言模型。我们的方法融合了多项架构优化技术，包括FlashAttention、旋转位置编码和GLU前馈层。我们在四项下游专利分类任务上评估了所提出的模型。我们的模型ModernBERT-base-PT在四分之三的数据集上持续优于通用ModernBERT基线，并与基准模型PatentBERT取得了具有竞争力的性能。对ModernBERT-base-VX和Mosaic-BERT-large的补充实验表明，扩大模型规模并定制分词器能进一步提升特定任务的性能。值得注意的是，所有ModernBERT变体均保持了显著更快的推理速度——达到PatentBERT的3倍以上——这凸显了其在时间敏感应用场景中的适用性。这些结果证实了领域特定预训练与架构改进对专利导向自然语言处理任务的积极效益。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日