Modeling and analyzing long sequences of text is an essential task for Natural Language Processing. Success in capturing long text dynamics using neural language models will facilitate many downstream tasks such as coherence evaluation, text generation, machine translation and so on. This paper presents a novel approach to model sequences through a stochastic process. We introduce a likelihood-based training objective for the text encoder and design a more thorough measurement (score) for long text evaluation compared to the previous approach. The proposed training objective effectively preserves the sequence coherence, while the new score comprehensively captures both temporal and spatial dependencies. Theoretical properties of our new score show its advantages in sequence evaluation. Experimental results show superior performance in various sequence evaluation tasks, including global and local discrimination within and between documents of different lengths. We also demonstrate the encoder achieves competitive results on discriminating human and AI written text.
翻译:文本长序列的建模与分析是自然语言处理的核心任务。利用神经语言模型成功捕捉长文本动态特性,将有助于推进连贯性评估、文本生成、机器翻译等诸多下游任务。本文提出一种通过随机过程建模序列的新方法。我们为文本编码器引入了基于似然函数的训练目标,并针对长文本评估设计了比先前方法更全面的度量标准(评分)。所提出的训练目标能有效保持序列连贯性,而新评分标准能全面捕捉时间与空间依赖性。新评分标准的理论性质揭示了其在序列评估中的优势。实验结果表明,该方法在不同长度文档的内部与跨文档全局及局部判别任务中均取得优异性能。我们还证明该编码器在区分人类与人工智能生成文本方面具有竞争力。