Automatic summarization plays an important role in the exponential document growth on the Web. On content websites such as CNN.com and WikiHow.com, there often exist various kinds of side information along with the main document for attention attraction and easier understanding, such as videos, images, and queries. Such information can be used for better summarization, as they often explicitly or implicitly mention the essence of the article. However, most of the existing side-aware summarization methods are designed to incorporate either single-modal or multi-modal side information, and cannot effectively adapt to each other. In this paper, we propose a general summarization framework, which can flexibly incorporate various modalities of side information. The main challenges in designing a flexible summarization model with side information include: (1) the side information can be in textual or visual format, and the model needs to align and unify it with the document into the same semantic space, (2) the side inputs can contain information from various aspects, and the model should recognize the aspects useful for summarization. To address these two challenges, we first propose a unified topic encoder, which jointly discovers latent topics from the document and various kinds of side information. The learned topics flexibly bridge and guide the information flow between multiple inputs in a graph encoder through a topic-aware interaction. We secondly propose a triplet contrastive learning mechanism to align the single-modal or multi-modal information into a unified semantic space, where the summary quality is enhanced by better understanding the document and side information. Results show that our model significantly surpasses strong baselines on three public single-modal or multi-modal benchmark summarization datasets.
翻译:自动摘要在网络文档指数级增长中发挥着重要作用。在CNN.com和WikiHow.com等内容网站上,主文档常伴随多种侧信息(如视频、图像和查询)以吸引注意力和便于理解。此类信息因显式或隐式提及文章核心内容,可被用于优化摘要生成。然而,现有侧信息感知摘要方法大多针对单模态或多模态侧信息分别设计,无法有效相互适配。本文提出一个通用摘要框架,可灵活整合多种模态的侧信息。设计灵活融合侧信息的摘要模型面临两大挑战:(1) 侧信息可能以文本或视觉形式存在,模型需将其与文档对齐并统一至相同语义空间;(2) 侧输入可能包含多维度信息,模型需识别对摘要任务有用的方面。针对上述挑战,我们首先提出统一主题编码器,联合挖掘文档与各类侧信息中的潜在主题。学习到的主题通过主题感知交互,在图编码器中灵活桥接并引导多输入间的信息流。其次提出三元组对比学习机制,将单模态或多模态信息对齐至统一语义空间,通过更好理解文档与侧信息提升摘要质量。实验结果表明,我们的模型在三个公开单模态/多模态基准摘要数据集上显著超越强基线方法。