In this paper, we investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift, i.e., can LLMs generalize to domains that require specific knowledge such as medicine and law in a zero-shot fashion without additional in-domain training? To this end, we devise a series of experiments to explain the performance gap empirically. Our findings suggest that: (a) LLMs struggle with dataset demands of closed domains such as retrieving long answer spans; (b) Certain LLMs, despite showing strong overall performance, display weaknesses in meeting basic requirements as discriminating between domain-specific senses of words which we link to pre-processing decisions; (c) Scaling model parameters is not always effective for cross domain generalization; and (d) Closed-domain datasets are quantitatively much different than open-domain EQA datasets and current LLMs struggle to deal with them. Our findings point out important directions for improving existing LLMs.
翻译:本文研究大语言模型在领域漂移下的抽取式问答任务,即LLMs能否以零样本方式泛化至需要特定知识(如医学和法律)的领域,而无需额外的领域内训练?为此,我们设计了一系列实验以实证解释性能差距。研究发现:(a)LLMs难以满足封闭领域数据集的特定需求,例如检索长答案片段;(b)某些LLMs尽管整体表现强劲,但在满足基本要求(如区分词语的领域特定含义)方面存在不足,我们将此归因于预处理决策;(c)扩展模型参数并不总能有效提升跨领域泛化能力;(d)封闭领域数据集与开放领域EQA数据集在量化特征上差异显著,当前LLMs难以有效处理此类数据。本研究为改进现有LLMs指出了重要方向。