Towards More Robust Natural Language Understanding

Natural Language Understanding (NLU) is a branch of Natural Language Processing (NLP) that uses intelligent computer software to understand texts that encode human knowledge. Recent years have witnessed notable progress across various NLU tasks with deep learning techniques, especially with pretrained language models. Besides proposing more advanced model architectures, constructing more reliable and trustworthy datasets also plays a huge role in improving NLU systems, without which it would be impossible to train a decent NLU model. It's worth noting that the human ability of understanding natural language is flexible and robust. On the contrary, most of existing NLU systems fail to achieve desirable performance on out-of-domain data or struggle on handling challenging items (e.g., inherently ambiguous items, adversarial items) in the real world. Therefore, in order to have NLU models understand human language more effectively, it is expected to prioritize the study on robust natural language understanding. In this thesis, we deem that NLU systems are consisting of two components: NLU models and NLU datasets. As such, we argue that, to achieve robust NLU, the model architecture/training and the dataset are equally important. Specifically, we will focus on three NLU tasks to illustrate the robustness problem in different NLU tasks and our contributions (i.e., novel models and new datasets) to help achieve more robust natural language understanding. Moving forward, the ultimate goal for robust natural language understanding is to build NLU models which can behave humanly. That is, it's expected that robust NLU systems are capable to transfer the knowledge from training corpus to unseen documents more reliably and survive when encountering challenging items even if the system doesn't know a priori of users' inputs.

翻译：自然语言理解(NLU)是自然语言处理(NLP)的一个分支,它使用智能计算机软件来理解将人类知识编码的文本。近年来,在各种自然语言理解(NLU)任务中,以深层次的学习技术,特别是语言模型,取得了显著的进展。除了提出更先进的模型结构外,建设更可靠和值得信赖的数据集在改进自然语言系统方面也发挥着巨大的作用,如果没有这种模型,就不可能培训一个像样的自然语言理解模式。值得指出的是,人类理解自然语言的能力是灵活和强大的。相反,现有的非自然语言系统大多数无法在外部数据中取得理想的性能,或者在处理现实世界中具有挑战性的项目(例如,内在的模糊性项目、对抗性项目)上挣扎。因此,为了让自然语言模型模型能够更有效地理解人类语言,预计研究将优先进行稳健的自然语言理解。在这个理论中,NLU系统由两个组成部分组成:NLU模型知道具有挑战性的语言模型和NLU数据集。因此,我们说,为了实现稳健的NLU, 将使得模型的自然学习/训练成为我们之前的自然目标中的重要任务。