从头训练的新型德语编码器：ModernGBERT与转换型LLM2Vec模型对比 (New Encoders for German Trained from Scratch: Comparing ModernGBERT with Converted LLM2Vec Models)

Encoders remain essential for efficient German NLP and NLU scenarios despite the rise of decoder-only LLMs. This work studies two routes to high-quality German encoders under identical data and training constraints: 1) training from scratch and 2) converting decoders via LLM2Vec. We introduce two resources: ModernGBERT (134M, 1B), fully transparent German encoders in the ModernBERT style, and LL\"aMmleinVec (120M, 1B, 7B), decoder-to-encoder conversions trained with masked next-token prediction, both undergoing a context extension to 8.192 tokens. Across SuperGLEBer, ModernGBERT 1B sets a new state of the art (avg 0.808), surpassing GBERT Large (+4%) and the seven-times larger converted 7B model (0.787). On German MTEB after supervised fine-tuning, ModernGBERT 1B (0.551) approaches the converted 7B model (0.557). We release all models, checkpoints, datasets, and full training records, and introduce an encoder-adapted QA-NIAH evaluation. All in all, our results provide actionable guidance: when parameter efficiency and latency matter, from-scratch encoders dominate. When a pre-trained decoder exists and compute is a limited, conversion offers an effective alternative. ModernGBERT and LL\"aMmleinVec, including all code, data and intermediary checkpoints are published under a research-only RAIL license.

翻译：尽管仅解码器架构的大语言模型（LLM）日益兴起，编码器在高效的德语自然语言处理（NLP）与自然语言理解（NLU）场景中仍然至关重要。本研究在相同数据和训练约束下，探索了获取高质量德语编码器的两种路径：1）从头训练；2）通过LLM2Vec方法将解码器转换为编码器。我们引入了两项资源：ModernGBERT（134M, 1B参数）——采用ModernBERT风格、完全透明的德语编码器，以及LL\"aMmleinVec（120M, 1B, 7B参数）——通过掩码下一词预测训练的解码器到编码器转换模型，两者均将上下文长度扩展至8,192个词元。在SuperGLEU基准测试中，ModernGBERT 1B模型取得了新的最优性能（平均得分0.808），超越了GBERT Large模型（提升约4%）以及参数量七倍于其的转换型7B模型（得分0.787）。在监督微调后的德语MTEB基准上，ModernGBERT 1B模型（得分0.551）接近转换型7B模型的表现（得分0.557）。我们公开了所有模型、检查点、数据集及完整训练记录，并引入了适用于编码器的QA-NIAH评估方法。总体而言，我们的研究结果提供了具有实践指导意义的结论：当参数效率和推理延迟是关键考量时，从头训练的编码器占据优势；若已有预训练解码器且计算资源有限，模型转换则提供了一种有效的替代方案。ModernGBERT与LL\"aMmleinVec模型，包括全部代码、数据及中间检查点，均已基于仅供研究使用的RAIL许可证发布。