Generative Adversarial Networks (GAN) is a model for data synthesis, which creates plausible data through the competition of generator and discriminator. Although GAN application to image synthesis is extensively studied, it has inherent limitations to natural language generation. Because natural language is composed of discrete tokens, a generator has difficulty updating its gradient through backpropagation; therefore, most text-GAN studies generate sentences starting with a random token based on a reward system. Thus, the generators of previous studies are pre-trained in an autoregressive way before adversarial training, causing data memorization that synthesized sentences reproduce the training data. In this paper, we synthesize sentences using a framework similar to the original GAN. More specifically, we propose Text Embedding Space Generative Adversarial Networks (TESGAN) which generate continuous text embedding spaces instead of discrete tokens to solve the gradient backpropagation problem. Furthermore, TESGAN conducts unsupervised learning which does not directly refer to the text of the training data to overcome the data memorization issue. By adopting this novel method, TESGAN can synthesize new sentences, showing the potential of unsupervised learning for text synthesis. We expect to see extended research combining Large Language Models with a new perspective of viewing text as an continuous space.
翻译:生成对抗网络(GAN)是一种数据合成模型,通过生成器与判别器的对抗过程生成逼真数据。尽管GAN在图像合成领域的应用已被广泛研究,但其在自然语言生成中仍存在固有局限。由于自然语言由离散符号构成,生成器难以通过反向传播有效更新梯度;为此,现有文本GAN研究大多基于奖励机制,以随机词元为起点生成句子。然而,早期研究的生成器在对抗训练前需进行自回归预训练,这会导致数据记忆问题——合成句子复现训练数据。本文采用类似原始GAN的框架实现句子合成:具体而言,我们提出文本嵌入空间生成对抗网络(TESGAN),该模型通过生成连续文本嵌入空间而非离散词元来解决梯度反向传播问题。此外,TESGAN采用无监督学习策略,避免直接引用训练数据文本,从而克服数据记忆难题。通过这一创新方法,TESGAN能够合成全新句子,展现了无监督学习在文本合成领域的潜力。我们期待未来能将文本视为连续空间这一新视角与大型语言模型结合,拓展相关研究。