Transformer-based language models such as BERT have become foundational in NLP, yet their performance degrades in specialized domains like patents, which contain long, technical, and legally structured text. Prior approaches to patent NLP have primarily relied on fine-tuning general-purpose models or domain-adapted variants pretrained with limited data. In this work, we pretrain 3 domain-specific masked language models for patents, using the ModernBERT architecture and a curated corpus of over 60 million patent records. Our approach incorporates architectural optimizations, including FlashAttention, rotary embeddings, and GLU feed-forward layers. We evaluate our models on four downstream patent classification tasks. Our model, ModernBERT-base-PT, consistently outperforms the general-purpose ModernBERT baseline on three out of four datasets and achieves competitive performance with a baseline PatentBERT. Additional experiments with ModernBERT-base-VX and Mosaic-BERT-large demonstrate that scaling the model size and customizing the tokenizer further enhance performance on selected tasks. Notably, all ModernBERT variants retain substantially faster inference over - 3x that of PatentBERT - underscoring their suitability for time-sensitive applications. These results underscore the benefits of domain-specific pretraining and architectural improvements for patent-focused NLP tasks.
翻译:基于Transformer的语言模型(如BERT)已成为自然语言处理领域的基础模型,但其在专利等专业领域的性能会下降,因为专利文本具有篇幅长、技术性强且法律结构严谨的特点。以往的专利自然语言处理方法主要依赖于对通用模型进行微调,或使用有限数据预训练的领域适应变体。本研究采用ModernBERT架构,利用超过6000万条精选专利记录构建的语料库,预训练了三个面向专利领域的掩码语言模型。我们的方法融合了多项架构优化技术,包括FlashAttention、旋转位置编码和GLU前馈层。我们在四项下游专利分类任务上评估了所提出的模型。我们的模型ModernBERT-base-PT在四分之三的数据集上持续优于通用ModernBERT基线,并与基准模型PatentBERT取得了具有竞争力的性能。对ModernBERT-base-VX和Mosaic-BERT-large的补充实验表明,扩大模型规模并定制分词器能进一步提升特定任务的性能。值得注意的是,所有ModernBERT变体均保持了显著更快的推理速度——达到PatentBERT的3倍以上——这凸显了其在时间敏感应用场景中的适用性。这些结果证实了领域特定预训练与架构改进对专利导向自然语言处理任务的积极效益。