Question answering plays a pivotal role in human daily life because it involves our acquisition of knowledge about the world. However, due to the dynamic and ever-changing nature of real-world facts, the answer can be completely different when the time constraint in the question changes. Recently, Large Language Models (LLMs) have shown remarkable intelligence in question answering, while our experiments reveal that the aforementioned problems still pose a significant challenge to existing LLMs. This can be attributed to the LLMs' inability to perform rigorous reasoning based on surface-level text semantics. To overcome this limitation, rather than requiring LLMs to directly answer the question, we propose a novel approach where we reframe the $\textbf{Q}$uestion $\textbf{A}$nswering task $\textbf{a}$s $\textbf{P}$rogramming ($\textbf{QAaP}$). Concretely, by leveraging modern LLMs' superior capability in understanding both natural language and programming language, we endeavor to harness LLMs to represent diversely expressed text as well-structured code and select the best matching answer from multiple candidates through programming. We evaluate our QAaP framework on several time-sensitive question answering datasets and achieve decent improvement, up to $14.5$% over strong baselines. Our codes and data are available at https://github.com/TianHongZXY/qaap
翻译:问答在人类日常生活中扮演着关键角色,因为它涉及我们对世界知识的获取。然而,由于现实世界事实的动态性和不断变化性,当问题中的时间约束发生变化时,答案可能完全不同。近期,大型语言模型在问答领域展现出显著的智能,但我们的实验表明,前述问题对现有大型语言模型仍构成重大挑战。这归因于大型语言模型无法基于表面文本语义进行严谨推理。为克服这一局限,我们提出了一种新方法,不是要求大型语言模型直接回答问题,而是将$\textbf{问}$答$\textbf{答}$题任务重构$\textbf{为}$编程($\textbf{QAaP}$)。具体而言,通过利用现代大型语言模型同时理解自然语言和编程语言的卓越能力,我们致力于让大型语言模型将多样化表达的文本表示为结构良好的代码,并通过编程从多个候选中选择最佳匹配答案。我们在多个时间敏感型问答数据集上评估了QAaP框架,相较于强基线取得了高达$14.5%$的显著提升。我们的代码与数据已开源在https://github.com/TianHongZXY/qaap。