Automatically generating textual content with desired attributes is an ambitious task that people have pursued long. Existing works have made a series of progress in incorporating unimodal controls into language models (LMs), whereas how to generate controllable sentences with multimodal signals and high efficiency remains an open question. To tackle the puzzle, we propose a new paradigm of zero-shot controllable text generation with multimodal signals (\textsc{ZeroGen}). Specifically, \textsc{ZeroGen} leverages controls of text and image successively from token-level to sentence-level and maps them into a unified probability space at decoding, which customizes the LM outputs by weighted addition without extra training. To achieve better inter-modal trade-offs, we further introduce an effective dynamic weighting mechanism to regulate all control weights. Moreover, we conduct substantial experiments to probe the relationship of being in-depth or in-width between signals from distinct modalities. Encouraging empirical results on three downstream tasks show that \textsc{ZeroGen} not only outperforms its counterparts on captioning tasks by a large margin but also shows great potential in multimodal news generation with a higher degree of control. Our code will be released at https://github.com/ImKeTT/ZeroGen.
翻译:自动生成具有期望属性的文本内容一直是人们长期追求的艰巨任务。现有工作已在将单模态控制融入语言模型方面取得系列进展,但如何利用多模态信号高效生成可控句子仍是一个开放问题。为攻克这一难题,我们提出了一种基于多模态信号的零样本可控文本生成新范式(\textsc{ZeroGen})。具体而言,\textsc{ZeroGen} 从词级到句级依次利用文本与图像控制信号,在解码阶段将其映射至统一概率空间,通过加权加法定制语言模型输出且无需额外训练。为获得更优的跨模态权衡,我们进一步引入有效的动态权重机制以调节所有控制权重。此外,我们通过大量实验探究了不同模态信号之间深度与广度关联性。在三个下游任务上的实证结果表明,\textsc{ZeroGen} 不仅在图像描述任务上大幅超越现有方法,还在高可控度的多模态新闻生成中展现出巨大潜力。我们的代码将发布在 https://github.com/ImKeTT/ZeroGen。