The development of domain-specific language models has significantly advanced natural language processing applications in various specialized fields, particularly in biomedicine. However, the focus has largely been on English-language models, leaving a gap for less-resourced languages such as Italian. This paper introduces Igea, the first decoder-only language model designed explicitly for biomedical text generation in Italian. Built on the Minerva model and continually pretrained on a diverse corpus of Italian medical texts, Igea is available in three model sizes: 350 million, 1 billion, and 3 billion parameters. The models aim to balance computational efficiency and performance, addressing the challenges of managing the peculiarities of medical terminology in Italian. We evaluate Igea using a mix of in-domain biomedical corpora and general-purpose benchmarks, highlighting its efficacy and retention of general knowledge even after the domain-specific training. This paper discusses the model's development and evaluation, providing a foundation for future advancements in Italian biomedical NLP.
翻译:领域专用语言模型的发展显著推动了自然语言处理在各个专业领域的应用,特别是在生物医学领域。然而,现有研究主要集中于英语模型,导致如意大利语等资源较少的语言存在明显空白。本文介绍了Igea,这是首个专门为意大利语生物医学文本生成设计的仅解码器语言模型。该模型基于Minerva架构,并在多样化的意大利语医学文本语料库上进行了持续预训练,提供三种参数量版本:3.5亿、10亿和30亿参数。这些模型旨在平衡计算效率与性能,以应对处理意大利语医学术语特殊性的挑战。我们通过结合领域内生物医学语料库和通用基准测试对Igea进行评估,结果表明其在经过领域特定训练后仍能保持优异性能及通用知识保留能力。本文详细阐述了模型的开发与评估过程,为意大利语生物医学自然语言处理的未来发展奠定了基础。