Discourse analysis is an important task because it models intrinsic semantic structures between sentences in a document. Discourse markers are natural representations of discourse in our daily language. One challenge is that the markers as well as pre-defined and human-labeled discourse relations can be ambiguous when describing the semantics between sentences. We believe that a better approach is to use a contextual-dependent distribution over the markers to express discourse information. In this work, we propose to learn a Distributed Marker Representation (DMR) by utilizing the (potentially) unlimited discourse marker data with a latent discourse sense, thereby bridging markers with sentence pairs. Such representations can be learned automatically from data without supervision, and in turn provide insights into the data itself. Experiments show the SOTA performance of our DMR on the implicit discourse relation recognition task and strong interpretability. Our method also offers a valuable tool to understand complex ambiguity and entanglement among discourse markers and manually defined discourse relations.
翻译:话语分析是一项重要任务,因为它能建模文档中句子间的内在语义结构。话语标记是我们日常语言中话语的自然表示。一个挑战在于,当描述句子间语义时,标记以及预定义和人工标注的话语关系可能具有模糊性。我们认为,更好的方法是使用基于上下文的标记分布来表示话语信息。在本工作中,我们提出通过利用(潜在)无限的话语标记数据及其隐含的话语意义,学习分布式标记表示(DMR),从而在标记与句子对之间建立桥梁。这种表示可从数据中自动学习,无需监督,并反过来揭示数据本身的洞察。实验表明,我们的DMR在隐式话语关系识别任务上达到了最先进的性能,并具有强可解释性。我们的方法还提供了一个有价值的工具,用于理解话语标记与人工定义话语关系之间复杂的模糊性和纠缠性。