Paraphrase generation is a pivotal task in natural language processing (NLP). Existing datasets in the domain lack syntactic and lexical diversity, resulting in paraphrases that closely resemble the source sentences. Moreover, these datasets often contain hate speech and noise, and may unintentionally include non-English language sentences. This research introduces ParaFusion, a large-scale, high-quality English paraphrase dataset developed using Large Language Models (LLM) to address these challenges. ParaFusion augments existing datasets with high-quality data, significantly enhancing both lexical and syntactic diversity while maintaining close semantic similarity. It also mitigates the presence of hate speech and reduces noise, ensuring a cleaner and more focused English dataset. Results show that ParaFusion offers at least a 25% improvement in both syntactic and lexical diversity, measured across several metrics for each data source. The paper also aims to set a gold standard for paraphrase evaluation as it contains one of the most comprehensive evaluation strategies to date. The results underscore the potential of ParaFusion as a valuable resource for improving NLP applications.
翻译:摘要:释义生成是自然语言处理(NLP)中的关键任务。现有数据集在句法和词汇多样性方面存在不足,导致生成的释义与源句子高度相似。此外,这些数据集常包含仇恨言论和噪声,甚至可能无意中引入非英语句子。本研究提出了ParaFusion——一个基于大语言模型(LLM)开发的大规模高质量英语释义数据集,以解决上述问题。ParaFusion通过高质量数据扩充现有数据集,在保持语义高度相似性的同时,显著提升词汇与句法多样性。该数据集还减少了仇恨言论和噪声,确保数据集的纯净性与英语专注性。结果表明,针对每个数据源的多种指标测量,ParaFusion在句法和词汇多样性上均实现至少25%的提升。本文还旨在通过构建迄今最全面的评估策略之一,确立释义评估的黄金标准。实验结果凸显了ParaFusion作为提升NLP应用价值资源的潜力。