We measure the performance of in-context learning as a function of task novelty and difficulty for open and closed questions. For that purpose, we created a novel benchmark consisting of hard scientific questions, each paired with a context of various relevancy. We show that counter-intuitively, a context that is more aligned with the topic does not always help more than a less relevant context. This effect is especially visible for open questions and questions of high difficulty or novelty. This result reveals a fundamental difference between the treatment of close-form and open-form questions by large-language models and shows a need for a more robust evaluation of in-context learning on the variety of different types of questions. It also poses a new question of how to optimally select a context for large language models, especially in the context of Retrieval Augmented Generation (RAG) systems. Our results suggest that the answer to this question can be highly application-dependent and might be contingent on factors including the format of the question, the perceived difficulty level of the questions, and the novelty or popularity of the information we seek.
翻译:我们测量了情境学习在开放与封闭式问题上,随任务新颖性与难度变化的性能表现。为此,我们创建了一个由困难科学问题组成的新基准,每个问题都配以不同相关度的上下文。研究表明,与直觉相反,与主题更契合的上下文并不总是比相关性较低的上下文更有帮助。这种效应在开放性问题以及高难度或高新颖性的问题上尤为明显。这一结果揭示了大语言模型在处理封闭式与开放式问题时存在根本差异,并表明需要对情境学习在不同类型问题上的表现进行更稳健的评估。同时,这也提出了一个新问题:如何为大语言模型(尤其是在检索增强生成系统背景下)最优地选择上下文。我们的结果表明,该问题的答案可能高度依赖于具体应用场景,并可能受问题形式、感知难度水平以及所寻求信息的新颖性或普及度等因素的影响。