Open-weight LLMs have been released by frontier labs; however, sovereign Large Language Models (for languages other than English) remain low in supply yet high in demand. Training large language models (LLMs) for low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce Dicta-LM 3.0: an open-weight collection of LLMs trained on substantially-sized corpora of Hebrew and English texts. The model is released in three sizes: 24B - adapted from the Mistral-Small-3.1 base model, 12B - adapted from the NVIDIA Nemotron Nano V2 model, and 1.7B - adapted from the Qwen3-1.7B base model. We are releasing multiple variants of each model, each with a native context length of 65k tokens; base model and chat model with tool-calling support. To rigorously evaluate our models, we introduce a new benchmark suite for evaluation of Hebrew chat-LLMs, covering a diverse set of tasks including Translation, Summarization, Winograd, Israeli Trivia, and Diacritization (nikud). Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.
翻译:前沿实验室已发布多个开源权重的大语言模型;然而,针对英语以外语言的主权大语言模型仍然供给不足但需求旺盛。为希伯来语等低资源语言训练大语言模型面临着独特的挑战。本文介绍Dicta-LM 3.0:一个基于大规模希伯来语和英语文本语料库训练的开源权重大语言模型集合。该模型发布三种规模版本:24B——基于Mistral-Small-3.1基础模型适配,12B——基于NVIDIA Nemotron Nano V2模型适配,以及1.7B——基于Qwen3-1.7B基础模型适配。我们为每个规模发布多个变体,各版本均具备65k词元的原生上下文长度;包括基础模型和具备工具调用功能的对话模型。为系统评估模型性能,我们引入了一套新的希伯来语对话大语言模型基准测试集,涵盖翻译、摘要生成、Winograd推理、以色列知识问答和希伯来语元音标注(nikud)等多样化任务。我们的工作不仅解决了低资源语言训练大语言模型的复杂性,还提出了可适用于其他大语言模型适配多种非英语语言的框架,为多语言自然语言处理领域的拓展作出贡献。