Leveraging external knowledge to enhance the reasoning ability is crucial for commonsense question answering. However, the existing knowledge bases heavily rely on manual annotation which unavoidably causes deficiency in coverage of world-wide commonsense knowledge. Accordingly, the knowledge bases fail to be flexible enough to support the reasoning over diverse questions. Recently, large-scale language models (LLMs) have dramatically improved the intelligence in capturing and leveraging knowledge, which opens up a new way to address the issue of eliciting knowledge from language models. We propose a Unified Facts Obtaining (UFO) approach. UFO turns LLMs into knowledge sources and produces relevant facts (knowledge statements) for the given question. We first develop a unified prompt consisting of demonstrations that cover different aspects of commonsense and different question styles. On this basis, we instruct the LLMs to generate question-related supporting facts for various commonsense questions via prompting. After facts generation, we apply a dense retrieval-based fact selection strategy to choose the best-matched fact. This kind of facts will be fed into the answer inference model along with the question. Notably, due to the design of unified prompts, UFO can support reasoning in various commonsense aspects (including general commonsense, scientific commonsense, and social commonsense). Extensive experiments on CommonsenseQA 2.0, OpenBookQA, QASC, and Social IQA benchmarks show that UFO significantly improves the performance of the inference model and outperforms manually constructed knowledge sources.
翻译:摘要:利用外部知识增强推理能力对于常识问答至关重要。然而,现有知识库严重依赖人工标注,这不可避免地导致其在全球常识知识的覆盖范围上存在不足。因此,知识库难以灵活支持对不同问题的推理。近年来,大规模语言模型在捕捉和利用知识方面显著提升了智能水平,这为解决从语言模型中获取知识的问题开辟了新途径。我们提出了一种统一事实获取方法。该方法将大规模语言模型转化为知识源,并为给定问题生成相关事实(知识陈述)。我们首先设计了一种统一提示模板,包含涵盖不同常识方面和不同问题风格的示例。在此基础上,我们通过提示引导大规模语言模型为各类常识问题生成相关的支持性事实。生成事实后,我们采用基于密集检索的事实选择策略来挑选最匹配的事实。这些事实将与问题一起输入答案推理模型。值得注意的是,由于统一提示模板的设计,该方法能够支持多种常识方面的推理(包括通用常识、科学常识和社会常识)。在CommonsenseQA 2.0、OpenBookQA、QASC和Social IQA基准测试上的大量实验表明,该方法显著提升了推理模型的性能,并超越了人工构建的知识源。