We create two German-only decoder models, LL\"aMmlein 120M and 1B, transparently from scratch and publish them, along with the training data, for the German NLP research community to use. The model training involved several key steps, including extensive data preprocessing, the creation of a custom German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks. Throughout the training process, multiple checkpoints were saved and analyzed using the SuperGLEBer benchmark to monitor the models' learning dynamics. Compared to state-of-the-art models on the SuperGLEBer benchmark, both LL\"aMmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models' quality scales with size as expected, but performance improvements on some tasks plateaued early, offering valuable insights into resource allocation for future model development.
翻译:我们从头开始透明地构建了两个纯德语解码器模型——LLäMmlein 120M和1B,并将其与训练数据一同发布,供德语自然语言处理研究社区使用。模型训练过程包含多个关键步骤:广泛的数据预处理、定制德语分词器的创建、训练本身,以及对最终模型在多个基准测试上的评估。在整个训练过程中,我们保存了多个检查点,并利用SuperGLEBer基准测试分析模型的学习动态。与SuperGLEBer基准测试上的最先进模型相比,两个LLäMmlein模型均展现出竞争力,始终达到或超越参数量相近模型的性能。结果表明,模型质量随规模扩大而提升,符合预期,但部分任务上的性能提升较早进入平台期,这为未来模型开发的资源分配提供了有价值的见解。