Generative Adversarial Networks (GAN) is a model for data synthesis, which creates plausible data through the competition of generator and discriminator. Although GAN application to image synthesis is extensively studied, it has inherent limitations to natural language generation. Because natural language is composed of discrete tokens, a generator has difficulty updating its gradient through backpropagation; therefore, most text-GAN studies generate sentences starting with a random token based on a reward system. Thus, the generators of previous studies are pre-trained in an autoregressive way before adversarial training, causing data memorization that synthesized sentences reproduce the training data. In this paper, we synthesize sentences using a framework similar to the original GAN. More specifically, we propose Text Embedding Space Generative Adversarial Networks (TESGAN) which generate continuous text embedding spaces instead of discrete tokens to solve the gradient backpropagation problem. Furthermore, TESGAN conducts unsupervised learning which does not directly refer to the text of the training data to overcome the data memorization issue. By adopting this novel method, TESGAN can synthesize new sentences, showing the potential of unsupervised learning for text synthesis. We expect to see extended research combining Large Language Models with a new perspective of viewing text as an continuous space.
翻译:生成对抗网络(GAN)是一种数据合成模型,通过生成器与判别器的对抗生成逼真数据。尽管GAN在图像合成中的应用已得到广泛研究,但其在自然语言生成中仍存在固有局限性。由于自然语言由离散符号组成,生成器难以通过反向传播更新梯度;因此,大多数文本GAN研究基于奖励机制从随机符号开始生成句子。然而,以往研究的生成器在对抗训练前需以自回归方式进行预训练,导致数据记忆问题——合成句子重复训练数据。本文采用类似原始GAN的框架进行句子合成。具体而言,我们提出文本嵌入空间生成对抗网络(TESGAN),通过生成连续文本嵌入空间而非离散符号来克服梯度反向传播问题。此外,TESGAN采用无监督学习,不直接引用训练数据中的文本,从而解决数据记忆问题。通过这一创新方法,TESGAN能够合成新句子,展现出无监督学习在文本合成中的潜力。我们期待将文本视为连续空间的新视角与大型语言模型结合的研究得以拓展。