ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

Natural Language Generation (NLG) accepts input data in the form of images, videos, or text and generates corresponding natural language text as output. Existing NLG methods mainly adopt a supervised approach and rely heavily on coupled data-to-text pairs. However, for many targeted scenarios and for non-English languages, sufficient quantities of labeled data are often not available. To relax the dependency on labeled data of downstream tasks, we propose an intuitive and effective zero-shot learning framework, ZeroNLG, which can deal with multiple NLG tasks, including image-to-text (image captioning), video-to-text (video captioning), and text-to-text (neural machine translation), across English, Chinese, German, and French within a unified framework. ZeroNLG does not require any labeled downstream pairs for training. During training, ZeroNLG (i) projects different domains (across modalities and languages) to corresponding coordinates in a shared common latent space; (ii) bridges different domains by aligning their corresponding coordinates in this space; and (iii) builds an unsupervised multilingual auto-encoder to learn to generate text by reconstructing the input text given its coordinate in shared latent space. Consequently, during inference, based on the data-to-text pipeline, ZeroNLG can generate target sentences across different languages given the coordinate of input data in the common space. Within this unified framework, given visual (imaging or video) data as input, ZeroNLG can perform zero-shot visual captioning; given textual sentences as input, ZeroNLG can perform zero-shot machine translation. We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and believable outputs and significantly outperforms existing zero-shot methods.

翻译：自然语言生成（NLG）以图像、视频或文本形式接收输入数据，并生成相应的自然语言文本作为输出。现有NLG方法主要采用监督学习方式，严重依赖成对的耦合数据-文本对。然而，对于许多目标场景及非英语语言，往往缺乏足够数量的标注数据。为缓解下游任务对标注数据的依赖，我们提出了一种直观且有效的零样本学习框架ZeroNLG，该框架能在一个统一框架内处理包括图像到文本（图像描述）、视频到文本（视频描述）及文本到文本（神经机器翻译）在内的多种NLG任务，覆盖英语、中文、德语和法语。ZeroNLG在训练时无需任何标注的下游任务对。训练过程中，ZeroNLG（i）将不同域（跨模态与语言）映射到共享公共潜空间中的对应坐标；（ii）通过对齐该空间中域的对应坐标来桥接不同域；（iii）构建无监督的多语言自编码器，通过根据输入文本在共享潜空间中的坐标重构输入文本来学习生成文本。因此，在推理阶段，基于数据到文本流水线，ZeroNLG能根据输入数据在公共空间中的坐标生成跨语言的目标句子。在该统一框架下，给定视觉（图像或视频）数据作为输入，ZeroNLG可执行零样本视觉描述；给定文本句子作为输入，ZeroNLG可执行零样本机器翻译。我们展示了在十二项NLG任务上的大量实验结果，表明在不使用任何标注下游任务对进行训练的情况下，ZeroNLG能生成高质量且可信的输出，并显著优于现有零样本方法。