Language Modeling and Understanding Through Paraphrase Generation and Detection

Language enables humans to share knowledge, reason about the world, and pass on strategies for survival and innovation across generations. At the heart of this process is not just the ability to communicate but also the remarkable flexibility in how we can express ourselves. We can express the same thoughts in virtually infinite ways using different words and structures - this ability to rephrase and reformulate expressions is known as paraphrase. Modeling paraphrases is a keystone to meaning in computational language models; being able to construct different variations of texts that convey the same meaning or not shows strong abilities of semantic understanding. If computational language models are to represent meaning, they must understand and control the different aspects that construct the same meaning as opposed to different meanings at a fine granularity. Yet most existing approaches reduce paraphrasing to a binary decision between two texts or to producing a single rewrite of a source, obscuring which linguistic factors are responsible for meaning preservation. In this thesis, I propose that decomposing paraphrases into their constituent linguistic aspects (paraphrase types) offers a more fine-grained and cognitively grounded view of semantic equivalence. I show that even advanced machine learning models struggle with this task. Yet, when explicitly trained on paraphrase types, models achieve stronger performance on related paraphrase tasks and downstream applications. For example, in plagiarism detection, language models trained on paraphrase types surpass human baselines: 89.6% accuracy compared to 78.4% for plagiarism cases from Wikipedia, and 66.5% compared to 55.7% for plagiarism of scientific papers from arXiv. In identifying duplicate questions on Quora, models trained with paraphrase types improve over models trained on binary pairs. Furthermore, I demonstrate that...

翻译：语言使人类能够共享知识、推理世界，并将生存与创新的策略代代相传。这一过程的核心不仅是交流能力，更在于我们表达方式的非凡灵活性。我们几乎可以用无限种不同的词汇和结构表达相同的思想——这种重新措辞与重构表达的能力被称为复述。在计算语言模型中，对复述的建模是理解意义的关键基石；能够生成传达相同意义或不同意义的各种文本变体，展现出强大的语义理解能力。若要使计算语言模型能够表征意义，它们必须在细粒度上理解并控制构成相同意义（而非不同意义）的各个层面。然而，现有方法大多将复述简化为两个文本间的二元判定，或仅生成源文本的单一改写，这掩盖了哪些语言因素对意义保持起决定性作用。本论文提出：将复述分解为构成性语言层面（复述类型），能为语义等价性提供更细粒度且认知基础更坚实的视角。研究表明，即使是先进的机器学习模型在此任务上也面临困难。然而，当模型在复述类型上接受显式训练后，其在相关复述任务及下游应用中表现出更强的性能。例如，在抄袭检测任务中，经过复述类型训练的语言模型超越了人类基线：针对维基百科抄袭案例的检测准确率达到89.6%（人类基线为78.4%），针对arXiv科学论文抄袭案例的检测准确率达到66.5%（人类基线为55.7%）。在识别Quora重复问题时，经过复述类型训练的模型优于基于二元文本对训练的模型。此外，研究还证明……