ZeroNLG: Aligning and Autoencoding Domains for Zero-Shot Multimodal and Multilingual Natural Language Generation

Natural Language Generation (NLG) accepts input data in the form of images, videos, or text and generates corresponding natural language text as output. Existing NLG methods mainly adopt a supervised approach and rely heavily on coupled data-to-text pairs. However, for many targeted scenarios and for non-English languages, sufficient quantities of labeled data are often not available. To relax the dependency on labeled data of downstream tasks, we propose an intuitive and effective zero-shot learning framework, ZeroNLG, which can deal with multiple NLG tasks, including image-to-text (image captioning), video-to-text (video captioning), and text-to-text (neural machine translation), across English, Chinese, German, and French within a unified framework. ZeroNLG does not require any labeled downstream pairs for training. During training, ZeroNLG (i) projects different domains (across modalities and languages) to corresponding coordinates in a shared common latent space; (ii) bridges different domains by aligning their corresponding coordinates in this space; and (iii) builds an unsupervised multilingual auto-encoder to learn to generate text by reconstructing the input text given its coordinate in shared latent space. Consequently, during inference, based on the data-to-text pipeline, ZeroNLG can generate target sentences across different languages given the coordinate of input data in the common space. Within this unified framework, given visual (imaging or video) data as input, ZeroNLG can perform zero-shot visual captioning; given textual sentences as input, ZeroNLG can perform zero-shot machine translation. We present the results of extensive experiments on twelve NLG tasks, showing that, without using any labeled downstream pairs for training, ZeroNLG generates high-quality and believable outputs and significantly outperforms existing zero-shot methods.

翻译：自然语言生成（NLG）以图像、视频或文本形式接收输入数据，并生成相应的自然语言文本作为输出。现有NLG方法主要采用监督学习范式，且高度依赖成对的数据-文本耦合数据。然而，对于许多特定场景及非英语语言，通常难以获取足量标注数据。为降低对下游任务标注数据的依赖，我们提出一种直观且有效的零样本学习框架ZeroNLG，该框架可在统一架构内处理多种NLG任务——包括图像到文本（图像描述）、视频到文本（视频描述）及文本到文本（神经机器翻译），并覆盖英语、中文、德语和法语。ZeroNLG无需任何标注的下游配对数据进行训练。在训练阶段，ZeroNLG（i）将不同领域（跨模态与跨语言）映射至共享公共潜在空间中的对应坐标；（ii）通过对齐该空间中不同领域的对应坐标来建立领域间的桥梁；（iii）构建无监督多语言自编码器，通过根据共享潜在空间中的坐标重构输入文本来学习文本生成。因此，在推理阶段，基于数据到文本的流程，ZeroNLG可根据输入数据在公共空间中的坐标生成跨不同语言的目标语句。在此统一框架下，给定视觉（图像或视频）数据作为输入，ZeroNLG可实现零样本视觉描述；给定文本语句作为输入，ZeroNLG可实现零样本机器翻译。我们在十二项NLG任务上进行了广泛实验，结果表明：在不使用任何标注下游配对数据训练的情况下，ZeroNLG能生成高质量且可信的输出，并显著优于现有零样本方法。