LLMs have demonstrated impressive zero-shot performance on NLP tasks thanks to the knowledge they acquired in their training. In multiple-choice QA tasks, the LM probabilities are used as an imperfect measure of the plausibility of each answer choice. One of the major limitations of the basic score is that it treats all words as equally important. We propose CASE, a Commonsense-Augmented Score with an Expanded Answer Space. CASE addresses this limitation by assigning importance weights for individual words based on their semantic relations to other words in the input. The dynamic weighting approach outperforms basic LM scores, not only because it reduces noise from unimportant words, but also because it informs the model of implicit commonsense knowledge that may be useful for answering the question. We then also follow prior work in expanding the answer space by generating lexically-divergent answers that are conceptually-similar to the choices. When combined with answer space expansion, our method outperforms strong baselines on 5 commonsense benchmarks. We further show these two approaches are complementary and may be especially beneficial when using smaller LMs.
翻译:大型语言模型(LLM)凭借其训练过程中获得的知识,在自然语言处理任务的零样本学习中展现出卓越性能。在多选题问答任务中,语言模型概率被用于粗略衡量各选项的合理性。基础评分方法的主要局限之一在于将所有词汇视为同等重要。本文提出CASE方法——一种基于常识增强与扩展答案空间的评分方法。CASE通过根据输入中词汇间的语义关联性为各单词赋予重要性权重,有效突破了这一局限。这种动态加权方法不仅通过降低无关词汇的噪声干扰,还通过向模型注入隐式常识知识(可能对答题有助益),从而优于基础语言模型评分方法。此外,我们延续先前研究思路,通过生成与选项概念相似但词汇不同的扩展答案来扩充答案空间。结合答案空间扩展后,我们的方法在五项常识推理基准测试中均超越强基线模型。进一步研究表明,这两种方法具有互补性,尤其在使用较小规模语言模型时具有显著优势。