We present DictaBERT, a new state-of-the-art pre-trained BERT model for modern Hebrew, outperforming existing models on most benchmarks. Additionally, we release three fine-tuned versions of the model, designed to perform three specific foundational tasks in the analysis of Hebrew texts: prefix segmentation, morphological tagging and question answering. These fine-tuned models allow any developer to perform prefix segmentation, morphological tagging and question answering of a Hebrew input with a single call to a HuggingFace model, without the need to integrate any additional libraries or code. In this paper we describe the details of the training as well and the results on the different benchmarks. We release the models to the community, along with sample code demonstrating their use. We release these models as part of our goal to help further research and development in Hebrew NLP.
翻译:我们提出DictaBERT,一种面向现代希伯来语的最新预训练BERT模型,在多数基准测试中均优于现有模型。此外,我们发布了该模型的三种微调版本,专为希伯来语文本分析中的三项基础任务设计:前缀切分、词法标注与问答。这些微调模型使任何开发者仅需单次调用HuggingFace模型即可完成希伯来语输入的前缀切分、词法标注与问答任务,无需整合额外库或代码。本文详述了训练过程及各项基准测试结果。我们将这些模型连同演示其使用的示例代码一同发布,旨在助力希伯来语自然语言处理领域的进一步研究与发展。