We present DictaBERT, a new state-of-the-art pre-trained BERT model for modern Hebrew, outperforming existing models on most benchmarks. Additionally, we release two fine-tuned versions of the model, designed to perform two specific foundational tasks in the analysis of Hebrew texts: prefix segmentation and morphological tagging. These fine-tuned models allow any developer to perform prefix segmentation and morphological tagging of a Hebrew sentence with a single call to a HuggingFace model, without the need to integrate any additional libraries or code. In this paper we describe the details of the training as well and the results on the different benchmarks. We release the models to the community, along with sample code demonstrating their use. We release these models as part of our goal to help further research and development in Hebrew NLP.
翻译:我们提出DictaBERT,一个针对现代希伯来语的最新预训练BERT模型,在多数基准测试中优于现有模型。此外,我们发布了该模型的两个微调版本,专用于希伯来语文本分析的两项基础任务:前缀分割与形态标注。这些微调模型允许任何开发者通过单次调用HuggingFace模型即可完成希伯来语句子的前缀分割与形态标注,无需集成额外库或代码。本文详细描述了训练过程及不同基准测试上的结果。我们将模型与示例代码一同发布,旨在推动希伯来语自然语言处理的进一步研究与开发。