KEPR: Knowledge Enhancement and Plausibility Ranking for Generative Commonsense Question Answering

Generative commonsense question answering (GenCQA) is a task of automatically generating a list of answers given a question. The answer list is required to cover all reasonable answers. This presents the considerable challenges of producing diverse answers and ranking them properly. Incorporating a variety of closely-related background knowledge into the encoding of questions enables the generation of different answers. Meanwhile, learning to distinguish positive answers from negative ones potentially enhances the probabilistic estimation of plausibility, and accordingly, the plausibility-based ranking. Therefore, we propose a Knowledge Enhancement and Plausibility Ranking (KEPR) approach grounded on the Generate-Then-Rank pipeline architecture. Specifically, we expand questions in terms of Wiktionary commonsense knowledge of keywords, and reformulate them with normalized patterns. Dense passage retrieval is utilized for capturing relevant knowledge, and different PLM-based (BART, GPT2 and T5) networks are used for generating answers. On the other hand, we develop an ELECTRA-based answer ranking model, where logistic regression is conducted during training, with the aim of approximating different levels of plausibility in a polar classification scenario. Extensive experiments on the benchmark ProtoQA show that KEPR obtains substantial improvements, compared to the strong baselines. Within the experimental models, the T5-based GenCQA with KEPR obtains the best performance, which is up to 60.91% at the primary canonical metric Inc@3. It outperforms the existing GenCQA models on the current leaderboard of ProtoQA.

翻译：生成式常识问答（GenCQA）是一项根据给定问题自动生成答案列表的任务。答案列表需涵盖所有合理解答，这带来了生成多样化答案并对其进行合理排序的巨大挑战。将多种紧密相关的背景知识融入问题编码中，有助于生成不同答案；同时，学习区分正向与负向答案，可提升合理性的概率估计，进而优化基于合理性的排序。为此，我们提出了一种基于“先生成后排序”流水线架构的知识增强与合理性排序方法（KEPR）。具体而言，我们利用维基词典中关键词的常识知识对问题进行扩展，并通过规范化模式对其重新表述；采用密集段落检索捕捉相关知识，并基于不同PLM架构（BART、GPT2和T5）的网络生成答案。另一方面，我们开发了一个基于ELECTRA的答案排序模型，在训练过程中引入逻辑回归，旨在二分类场景中逼近不同层次的合理性。在基准数据集ProtoQA上的大量实验表明，与强基线模型相比，KEPR获得了显著性能提升。在实验模型中，基于T5的GenCQA结合KEPR取得了最佳性能，其主要规范指标Inc@3达到60.91%，超越了当前ProtoQA排行榜上的现有GenCQA模型。