This paper presents Mecellem models, a framework for developing specialized language models for the Turkish legal domain through domain adaptation strategies. We make two contributions: (1)Encoder Model Pre-trained from Scratch: ModernBERT-based bidirectional encoders pre-trained on a Turkish-dominant corpus of 112.7 billion tokens. We implement a checkpoint selection strategy that evaluates downstream retrieval performance throughout training, revealing that optimal checkpoints achieve best retrieval scores before pre-training loss reaches its minimum. Our encoder models achieve top-3 rankings on the Turkish retrieval leaderboard, with smaller models (155M parameters) achieving comparable performance to larger reference models (307M-567M parameters). Our approach achieves 92.36% production efficiency compared to state-of-the-art models (embeddinggemma-300m: 100.00%, BAAI/bge-m3: 99.54%, newmindai/bge-m3-stsb: 94.38%), ranking fourth overall despite requiring less computational resources. SOTA models rely on multi-stage, computationally intensive training pipelines, making our single-stage pre-training followed by efficient post-training approach a cost-effective alternative; (2)Decoder Model with Continual Pre-training (CPT): Qwen3-1.7B and Qwen3-4B models adapted to Turkish legal domain through controlled curriculum learning. Four-phase CPT with optimal sample ratios enables gradual transition from general language knowledge to specialized legal terminology and long-context reasoning. This approach achieves 36.2% perplexity reduction on Turkish legal text, demonstrating domain adaptation gains.
翻译:本文提出了Mecellem模型框架,该框架通过领域适应策略开发面向土耳其法律领域的专用语言模型。我们做出了两项贡献:(1)从头预训练的编码器模型:基于ModernBERT的双向编码器在包含1127亿词符的土耳其语主导语料库上进行预训练。我们实施了检查点选择策略,在训练全程评估下游检索性能,发现最优检查点在预训练损失达到最小值之前即可获得最佳检索分数。我们的编码器模型在土耳其语检索排行榜上位列前三,其中较小模型(1.55亿参数)取得了与更大参考模型(3.07亿-5.67亿参数)相当的性能。相较于最先进模型(embeddinggemma-300m: 100.00%,BAAI/bge-m3: 99.54%,newmindai/bge-m3-stsb: 94.38%),我们的方法实现了92.36%的生产效率,尽管所需计算资源更少,仍位列总排名第四。当前SOTA模型依赖多阶段、计算密集的训练流程,而我们采用单阶段预训练结合高效后训练的方法,提供了一种经济高效的替代方案;(2)采用持续预训练(CPT)的解码器模型:通过受控课程学习将Qwen3-1.7B和Qwen3-4B模型适配至土耳其法律领域。四阶段CPT配合最优样本比例,实现了从通用语言知识到专业法律术语及长上下文推理的渐进过渡。该方法在土耳其法律文本上实现了36.2%的困惑度降低,证明了领域适应的有效性。