Encoder-only transformer models remain widely used for discriminative NLP tasks, yet recent architectural advances have largely focused on English. In this work, we present AraModernBERT, an adaptation of the ModernBERT encoder architecture to Arabic, and study the impact of transtokenized embedding initialization and native long-context modeling up to 8,192 tokens. We show that transtokenization is essential for Arabic language modeling, yielding dramatic improvements in masked language modeling performance compared to non-transtokenized initialization. We further demonstrate that AraModernBERT supports stable and effective long-context modeling, achieving improved intrinsic language modeling performance at extended sequence lengths. Downstream evaluations on Arabic natural language understanding tasks, including inference, offensive language detection, question-question similarity, and named entity recognition, confirm strong transfer to discriminative and sequence labeling settings. Our results highlight practical considerations for adapting modern encoder architectures to Arabic and other languages written in Arabic-derived scripts.
翻译:仅编码器的Transformer模型在判别式自然语言处理任务中仍被广泛使用,然而近期的架构进展主要集中于英语。本工作中,我们提出了AraModernBERT——将ModernBERT编码器架构适配至阿拉伯语,并研究了跨词元嵌入初始化与原生长达8,192词元的长上下文建模的影响。我们证明跨词元化对阿拉伯语语言建模至关重要,与非跨词元化初始化相比,在掩码语言建模性能上带来了显著提升。我们进一步证实AraModernBERT支持稳定高效的长上下文建模,在扩展序列长度下实现了改进的内在语言建模性能。在阿拉伯语自然语言理解任务(包括推理、冒犯性语言检测、问题-问题相似度及命名实体识别)上的下游评估,验证了其向判别式与序列标注场景的强迁移能力。我们的研究结果凸显了将现代编码器架构适配至阿拉伯语及其他阿拉伯衍生文字书写语言的实际考量。