Large Language Models (LLMs) have demonstrated considerable advances, and several claims have been made about their exceeding human performance. However, in real-world tasks, domain knowledge is often required. Low-resource learning methods like Active Learning (AL) have been proposed to tackle the cost of domain expert annotation, raising this question: Can LLMs surpass compact models trained with expert annotations in domain-specific tasks? In this work, we conduct an empirical experiment on four datasets from three different domains comparing SOTA LLMs with small models trained on expert annotations with AL. We found that small models can outperform GPT-3.5 with a few hundreds of labeled data, and they achieve higher or similar performance with GPT-4 despite that they are hundreds time smaller. Based on these findings, we posit that LLM predictions can be used as a warmup method in real-world applications and human experts remain indispensable in tasks involving data annotation driven by domain-specific knowledge.
翻译:大型语言模型(LLMs)已展现出显著进展,多项研究宣称其性能超越人类。然而,在实际任务中,往往需要领域知识。主动学习(Active Learning, AL)等低资源学习方法被提出以应对领域专家标注成本问题,由此引发疑问:LLMs能否在领域特定任务中超越经专家标注训练的紧凑模型?本研究在来自三个不同领域的四个数据集上开展实证实验,比较了最先进LLMs与采用主动学习方式经专家标注训练的小型模型。研究发现,小型模型仅需数百条标注数据即可超越GPT-3.5,且尽管其规模小数百倍,仍能达到与GPT-4相当或更优的性能。基于这些发现,我们提出:LLM预测可作为实际应用中的预热方法,而人类专家在涉及领域知识驱动的数据标注任务中仍不可或缺。