We propose to Transform Scene Graphs (TSG) into more descriptive captions. In TSG, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs. After embedding, different graph embeddings contain diverse specific knowledge for generating the words with different part-of-speech, e.g., object/attribute embedding is good for generating nouns/adjectives. Motivated by this, we design a Mixture-of-Expert (MOE)-based decoder, where each expert is built on MHA, for discriminating the graph embeddings to generate different kinds of words. Since both the encoder and decoder are built based on the MHA, as a result, we construct a homogeneous encoder-decoder unlike the previous heterogeneous ones which usually apply Fully-Connected-based GNN and LSTM-based decoder. The homogeneous architecture enables us to unify the training configuration of the whole model instead of specifying different training strategies for diverse sub-networks as in the heterogeneous pipeline, which releases the training difficulty. Extensive experiments on the MS-COCO captioning benchmark validate the effectiveness of our TSG. The code is in: https://github.com/GaryJiajia/TSG.
翻译:我们提出将场景图转换(TSG)为更具描述性的描述。在TSG中,我们应用多头注意力(MHA)设计图神经网络(GNN)来嵌入场景图。嵌入后,不同的图嵌入包含生成不同词性的单词所需的具体知识,例如,物体/属性嵌入有助于生成名词/形容词。受此启发,我们设计了一个基于混合专家(MOE)的解码器,其中每个专家基于MHA构建,用于区分图嵌入以生成不同类型的单词。由于编码器和解码器均基于MHA构建,我们构建了一个同质的编码器-解码器结构,不同于以往采用全连接GNN和基于LSTM的解码器的异质结构。同质架构使我们能够统一整个模型的训练配置,而无需像异质流程那样为不同子网络指定不同的训练策略,从而降低了训练难度。在MS-COCO图像描述基准上的大量实验验证了我们TSG的有效性。代码见:https://github.com/GaryJiajia/TSG。