Current approaches in paraphrase generation and detection heavily rely on a single general similarity score, ignoring the intricate linguistic properties of language. This paper introduces two new tasks to address this shortcoming by considering paraphrase types - specific linguistic perturbations at particular text positions. We name these tasks Paraphrase Type Generation and Paraphrase Type Detection. Our results suggest that while current techniques perform well in a binary classification scenario, i.e., paraphrased or not, the inclusion of fine-grained paraphrase types poses a significant challenge. While most approaches are good at generating and detecting general semantic similar content, they fail to understand the intrinsic linguistic variables they manipulate. Models trained in generating and identifying paraphrase types also show improvements in tasks without them. In addition, scaling these models further improves their ability to understand paraphrase types. We believe paraphrase types can unlock a new paradigm for developing paraphrase models and solving tasks in the future.
翻译:当前,改写生成与检测方法严重依赖单一通用相似度评分,忽略了语言复杂的语言学特性。本文通过引入改写类型——即在特定文本位置施加的具体语言学扰动——提出了两项新任务以弥补这一不足。我们将这些任务命名为改写类型生成与改写类型检测。我们的结果表明,尽管现有技术在二元分类场景(即判断文本是否被改写)中表现良好,但引入细粒度的改写类型带来了显著挑战。虽然大多数方法擅长生成和检测一般语义相似的内容,但它们无法理解其操作的内在语言变量。经过训练以生成和识别改写类型的模型,在未涉及这些类型的任务中也展现出性能提升。此外,扩展这些模型的规模进一步增强了它们理解改写类型的能力。我们相信,改写类型有望为未来开发改写模型及解决相关任务开辟新的范式。