Transformer models have revolutionized NLP, yet many morphologically rich languages remain underrepresented in large-scale pre-training efforts. With SindBERT, we set out to chart the seas of Turkish NLP, providing the first large-scale RoBERTa-based encoder for Turkish. Trained from scratch on 312 GB of Turkish text (mC4, OSCAR23, Wikipedia), SindBERT is released in both base and large configurations, representing the first large-scale encoder-only language model available for Turkish. We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark. Our results show that SindBERT performs competitively with existing Turkish and multilingual models, with the large variant achieving the best scores in two of four tasks but showing no consistent scaling advantage overall. This flat scaling trend, also observed for XLM-R and EuroBERT, suggests that current Turkish benchmarks may already be saturated. At the same time, comparisons with smaller but more curated models such as BERTurk highlight that corpus quality and diversity can outweigh sheer data volume. Taken together, SindBERT contributes both as an openly released resource for Turkish NLP and as an empirical case study on the limits of scaling and the central role of corpus composition in morphologically rich languages. The SindBERT models are released under the MIT license and made available in both fairseq and Huggingface formats.
翻译:Transformer模型已彻底改变了自然语言处理领域,然而许多形态丰富的语言在大规模预训练工作中仍然代表性不足。借助SindBERT,我们致力于探索土耳其自然语言处理的海洋,提供了首个基于RoBERTa的大规模土耳其语编码器。通过在312GB土耳其语文本(mC4、OSCAR23、维基百科)上从头训练,SindBERT发布了基础版和大型版两种配置,成为首个面向土耳其语的大规模纯编码器语言模型。我们在词性标注、命名实体识别、冒犯性语言检测以及TurBLiMP语言可接受性基准测试上对SindBERT进行了评估。结果表明,SindBERT与现有的土耳其语及多语言模型相比具有竞争力,其大型变体在四项任务中的两项取得了最佳分数,但整体未显示出持续的规模优势。这种平坦的扩展趋势在XLM-R和EuroBERT中也同样被观察到,表明当前土耳其语基准测试可能已趋于饱和。与此同时,与BERTurk等规模较小但经过更精细处理的模型进行比较表明,语料库的质量和多样性可能比单纯的数据量更为重要。综上所述,SindBERT既作为土耳其自然语言处理的开放资源,也作为关于形态丰富语言中规模扩展的局限性及语料库构成核心作用的实证案例研究作出了贡献。SindBERT模型基于MIT许可证发布,并提供fairseq和Huggingface两种格式。