Large Language Models are becoming the go-to solution for many natural language processing tasks, including in specialized domains where their few-shot capacities are expected to yield high performance in low-resource settings. Herein, we aim to assess the performance of Large Language Models for few shot clinical entity recognition in multiple languages. We evaluate named entity recognition in English, French and Spanish using 8 in-domain (clinical) and 6 out-domain gold standard corpora. We assess the performance of 10 auto-regressive language models using prompting and 16 masked language models used for text encoding in a biLSTM-CRF supervised tagger. We create a few-shot set-up by limiting the amount of annotated data available to 100 sentences. Our experiments show that although larger prompt-based models tend to achieve competitive F-measure for named entity recognition outside the clinical domain, this level of performance does not carry over to the clinical domain where lighter supervised taggers relying on masked language models perform better, even with the performance drop incurred from the few-shot set-up. In all experiments, the CO2 impact of masked language models is inferior to that of auto-regressive models. Results are consistent over the three languages and suggest that few-shot learning using Large language models is not production ready for named entity recognition in the clinical domain. Instead, models could be used for speeding-up the production of gold standard annotated data.
翻译:大语言模型正成为许多自然语言处理任务的首选解决方案,包括在专业领域中,其少样本能力预期能在低资源场景下取得高性能。本研究旨在评估大语言模型在多语言少样本临床实体识别中的表现。我们利用8个领域内(临床)和6个领域外黄金标准语料库,评估英语、法语和西班牙语的命名实体识别性能。我们评估了10个基于提示的自回归语言模型和16个用于biLSTM-CRF监督标注器中文本编码的掩码语言模型。通过将可用标注数据限制为100个句子,我们构建了少样本实验设置。实验表明,尽管基于提示的较大模型在临床领域外的命名实体识别中能达到有竞争力的F值,但这种性能并未延续到临床领域——依赖掩码语言模型的轻量级监督标注器表现更优,即使少样本设置导致性能下降也是如此。在所有实验中,掩码语言模型的CO2排放影响均低于自回归模型。该结果在三语言中保持一致,表明使用大语言模型进行少样本学习尚不能用于临床领域的命名实体识别生产环境,但可用于加速黄金标准标注数据的生成。