While large language models (LLMs) like GPT-3 have achieved impressive results on multiple choice question answering (MCQA) tasks in the zero, one, and few-shot settings, they generally lag behind the MCQA state of the art (SOTA). MCQA tasks have traditionally been presented to LLMs like cloze tasks. An LLM is conditioned on a question (without the associated answer options) and its chosen option is the one assigned the highest probability after normalization (for length, etc.). A more natural prompting approach is to present the question and answer options to the LLM jointly and have it output the symbol (e.g., "A") associated with its chosen answer option. This approach allows the model to explicitly compare answer options, reduces computational costs, and mitigates the effects of tokenization scheme and answer option representations on answer selection. For the natural approach to be effective, the LLM it is used with must be able to associate answer options with the symbols that represent them. The LLM needs what we term multiple choice symbol binding (MCSB) ability. This ability varies greatly by model. We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach across 20 diverse datasets and largely closes the gap with the SOTA, suggesting that the MCQA ability of LLMs has been previously underestimated.
翻译:尽管GPT-3等大语言模型在零样本、单样本和少样本场景下的多项选择问答任务中取得了显著成果,但它们普遍落后于多项选择问答领域的技术水平。传统上,多项选择问答任务被呈现给大语言模型的方式类似于完形填空任务:大语言模型基于一个问题(不含相关选项)进行条件生成,其选中的选项为经过标准化(如长度)处理后概率最高的那个。一种更自然的提示方法是联合向大语言模型呈现问题与答案选项,并让其输出与所选答案选项对应的符号(例如“A”)。这种方法允许模型显式比较各选项,降低计算成本,并减轻分词方案及答案选项表示对答案选择的影响。为使自然方法有效,所用的大语言模型必须能够将答案选项与其对应的符号关联起来,即模型需要具备我们称之为多项选择符号绑定(MCSB)的能力。这种能力在不同模型中差异显著。我们证明,具备高MCSB能力的模型在20个多样化数据集中采用自然方法的表现远优于传统方法,并基本拉平了与技术水平的差距,这表明大语言模型在多项选择问答中的能力此前被低估了。