Addressing the challenge of limited annotated data in specialized fields and low-resource languages is crucial for the effective use of Language Models (LMs). While most Large Language Models (LLMs) are trained on general-purpose English corpora, there is a notable gap in models specifically tailored for Italian, particularly for technical and bureaucratic jargon. This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in these specialized contexts. Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models. We evaluated the models on downstream tasks such as document classification and entity typing and conducted intrinsic evaluations using Pseudo-Log-Likelihood. The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting. Furthermore, the application of calibration techniques and in-domain verbalizers significantly enhances the efficacy of encoder models. These domain-specialized models prove to be particularly advantageous in scenarios where in-domain resources or expertise are scarce. In conclusion, our findings offer new insights into the use of Italian models in specialized contexts, which may have a significant impact on both research and industrial applications in the digital transformation era.
翻译:解决专业领域和低资源语言中标注数据有限的挑战,对于语言模型的有效应用至关重要。尽管大多数大型语言模型基于通用英语语料库训练,但专门针对意大利语(尤其是技术和官僚术语)的定制模型仍存在显著空白。本文探讨了采用较小规模的领域专用编码器语言模型结合提示技术,以提升这些专业场景下的性能。我们的研究聚焦于意大利官僚与法律语言,实验对象包括通用编码器模型及经过领域继续预训练的编码器模型。我们在文档分类和实体类型标注等下游任务上评估模型性能,并采用伪对数似然进行内在评估。结果表明,尽管继续预训练的模型在通用知识上可能表现出鲁棒性下降,但在领域特定任务(即使是零样本设置)中展现出更强的适应能力。此外,校准技术和领域内词元映射器的应用显著提升了编码器模型的效能。这些领域专用模型在领域资源或专业知识匮乏的场景中具有独特优势。总之,我们的发现为意大利语模型在专业场景中的应用提供了新见解,这可能对数字化转型时代的研究与工业应用产生重要影响。