Generative models have gained significant prominence in Natural Language Processing (NLP), especially in tackling the complex task of modeling and evaluating long text sequences. This task is crucial for advancing various downstream applications, such as text generation and machine translation. Recent methods that utilize stochastic processes to capture the intrinsic dynamics of sequences have shown superior performance in generative modeling. However, the accurate encoding of both temporal and structural dependencies from text datasets, as well as leveraging this encoded information for sequence evaluation, remains an open area of research. In this paper, we propose a novel approach to learn the stochastic dynamics of long text sequences, utilizing a negative log-likelihood-based encoder that outperforms contrastive learning methods. We also introduce a likelihood-based evaluation metric for long-text assessment, which measures sequence coherence and can be applied to downstream tasks such as Human-AI discrimination. Our encoder preserves sequence coherence effectively and performs robustly on out-of-domain datasets. Additionally, the proposed evaluation metric captures both temporal and structural information comprehensively. Theoretical analysis demonstrates the superiority of our metric in sequence evaluation, and experimental results highlight its flexibility and exceptional performance across a variety of tasks, showcasing its utility in diverse NLP applications.
翻译:生成模型在自然语言处理(NLP)领域已获得显著关注,尤其是在建模和评估长文本序列这一复杂任务中。该任务对于推动文本生成和机器翻译等各种下游应用至关重要。近期利用随机过程捕捉序列内在动态的方法在生成建模中已展现出优越性能。然而,如何从文本数据集中准确编码时间与结构依赖性,并利用此编码信息进行序列评估,仍是一个开放的研究领域。本文提出一种学习长文本序列随机动态的新方法,该方法采用基于负对数似然的编码器,其性能优于对比学习方法。我们还引入一种基于似然的长文本评估指标,用于衡量序列连贯性,并可应用于人机判别等下游任务。我们的编码器能有效保持序列连贯性,并在域外数据集上表现稳健。此外,所提出的评估指标能全面捕捉时间与结构信息。理论分析证明了该指标在序列评估中的优越性,实验结果突显了其在各类任务中的灵活性与卓越性能,展现了其在多样化NLP应用中的实用价值。