Paraphrases are texts that convey the same meaning while using different words or sentence structures. It can be used as an automatic data augmentation tool for many Natural Language Processing tasks, especially when dealing with low-resource languages, where data shortage is a significant problem. To generate a paraphrase in multilingual settings, previous studies have leveraged the knowledge from the machine translation field, i.e., forming a paraphrase through zero-shot machine translation in the same language. Despite good performance on human evaluation, those methods still require parallel translation datasets, thus making them inapplicable to languages that do not have parallel corpora. To mitigate that problem, we proposed the first unsupervised multilingual paraphrasing model, LAMPAT ($\textbf{L}$ow-rank $\textbf{A}$daptation for $\textbf{M}$ultilingual $\textbf{P}$araphrasing using $\textbf{A}$dversarial $\textbf{T}$raining), by which monolingual dataset is sufficient enough to generate a human-like and diverse sentence. Throughout the experiments, we found out that our method not only works well for English but can generalize on unseen languages as well. Data and code are available at https://github.com/phkhanhtrinh23/LAMPAT.
翻译:摘要:复述是指使用不同词汇或句子结构表达相同语义的文本。它可作为自然语言处理任务中的自动数据增强工具,尤其适用于处理低资源语言中的数据短缺问题。为生成多语言场景下的复述,先前研究借鉴了机器翻译领域的知识,即通过同语言零样本机器翻译形成复述。尽管在人工评估中表现良好,这些方法仍需依赖平行翻译数据集,因此无法适用于缺乏平行语料的语言。为解决此问题,我们提出了首个无监督多语言复述模型LAMPAT($\textbf{L}$ow-rank $\textbf{A}$daptation for $\textbf{M}$ultilingual $\textbf{P}$araphrasing using $\textbf{A}$dversarial $\textbf{T}$raining),仅需单语数据集即可生成接近人类水平且多样化的句子。实验表明,我们的方法不仅在英语上表现优异,还能泛化至未见语言。数据和代码已发布于https://github.com/phkhanhtrinh23/LAMPAT。