Natural language understanding (NLU) using neural network pipelines often requires additional context that is not solely present in the input data. Through Prior research, it has been evident that NLU benchmarks are susceptible to manipulation by neural models, wherein these models exploit statistical artifacts within the encoded external knowledge to artificially inflate performance metrics for downstream tasks. Our proposed approach, known as the Recap, Deliberate, and Respond (RDR) paradigm, addresses this issue by incorporating three distinct objectives within the neural network pipeline. Firstly, the Recap objective involves paraphrasing the input text using a paraphrasing model in order to summarize and encapsulate its essence. Secondly, the Deliberation objective entails encoding external graph information related to entities mentioned in the input text, utilizing a graph embedding model. Finally, the Respond objective employs a classification head model that utilizes representations from the Recap and Deliberation modules to generate the final prediction. By cascading these three models and minimizing a combined loss, we mitigate the potential for gaming the benchmark and establish a robust method for capturing the underlying semantic patterns, thus enabling accurate predictions. To evaluate the effectiveness of the RDR method, we conduct tests on multiple GLUE benchmark tasks. Our results demonstrate improved performance compared to competitive baselines, with an enhancement of up to 2\% on standard metrics. Furthermore, we analyze the observed evidence for semantic understanding exhibited by RDR models, emphasizing their ability to avoid gaming the benchmark and instead accurately capture the true underlying semantic patterns.
翻译:基于神经网络的自然语言理解(NLU)往往需要输入数据本身不直接提供的额外上下文。已有研究表明,NLU基准测试易受神经模型操控,这些模型利用编码外部知识中的统计假象,人为提升下游任务的性能指标。我们提出的方法——回顾、深思与回应(RDR)范式——通过在神经网络流程中设置三个独立目标来解决这一问题。首先,回顾目标利用转述模型对输入文本进行改写,以总结并提炼其核心要义。其次,深思目标通过图嵌入模型编码输入文本中提及实体相关的外部图结构信息。最后,回应目标采用分类头模型,结合回顾与深思模块的表征进行最终预测。通过级联这三个模型并最小化联合损失函数,我们降低了操控基准测试的可能,并构建了一种鲁棒方法以捕获底层语义模式,从而实现精确预测。为评估RDR方法的有效性,我们在多个GLUE基准任务上进行了测试。结果表明,与竞争性基线相比,我们的方法在标准评估指标上性能提升最高可达2%。此外,我们分析了RDR模型展现语义理解的观测证据,强调其能够避免操控基准测试,转而准确捕获真实的底层语义模式。