Disentangling the encodings of neural models is a fundamental aspect for improving interpretability, semantic control and downstream task performance in Natural Language Processing. Currently, most disentanglement methods are unsupervised or rely on synthetic datasets with known generative factors. We argue that recurrent syntactic and semantic regularities in textual data can be used to provide the models with both structural biases and generative factors. We leverage the semantic structures present in a representative and semantically dense category of sentence types, definitional sentences, for training a Variational Autoencoder to learn disentangled representations. Our experimental results show that the proposed model outperforms unsupervised baselines on several qualitative and quantitative benchmarks for disentanglement, and it also improves the results in the downstream task of definition modeling.
翻译:解耦神经模型编码是提升自然语言处理中可解释性、语义控制及下游任务性能的核心环节。当前多数解耦方法采用无监督学习或依赖已知生成因子的合成数据集。我们认为,文本数据中递归性的句法和语义规律可为模型提供结构先验与生成因子。本文利用定义性句子(一类具有代表性且语义密集的句类)中的语义结构,训练变分自编码器学习解耦表示。实验结果表明,在多个针对解耦性的定性与定量基准测试中,所提模型优于无监督基线方法,并在定义建模这一下游任务中显著提升了结果。