In this paper, we investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift, i.e., can LLMs generalize well to closed-domains that require specific knowledge such as medicine and law in a zero-shot fashion without additional in-domain training? To this end, we devise a series of experiments to empirically explain the performance gap. Our findings suggest that: a) LLMs struggle with dataset demands of closed-domains such as retrieving long answer-spans; b) Certain LLMs, despite showing strong overall performance, display weaknesses in meeting basic requirements as discriminating between domain-specific senses of words which we link to pre-processing decisions; c) Scaling model parameters is not always effective for cross-domain generalization; and d) Closed-domain datasets are quantitatively much different than open-domain EQA datasets and current LLMs struggle to deal with them. Our findings point out important directions for improving existing LLMs.
翻译:本文研究了大型语言模型在领域漂移下的抽取式问答任务,即LLMs能否以零样本方式良好泛化至需要特定知识(如医学和法律)的封闭领域,而无需额外的领域内训练?为此,我们设计了一系列实验以实证解释性能差距。研究发现:a) LLMs难以满足封闭领域的数据集需求,例如检索长答案片段;b) 某些LLMs尽管整体表现强劲,但在满足基本要求(如区分词语的领域特定含义)方面存在不足,我们将其归因于预处理决策;c) 扩展模型参数并不总能有效提升跨领域泛化能力;d) 封闭领域数据集在数量特征上与开放领域EQA数据集差异显著,当前LLMs难以有效处理这些差异。本研究为改进现有LLMs指出了重要方向。