We propose to Transform Scene Graphs (TSG) into more descriptive captions. In TSG, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs. After embedding, different graph embeddings contain diverse specific knowledge for generating the words with different part-of-speech, e.g., object/attribute embedding is good for generating nouns/adjectives. Motivated by this, we design a Mixture-of-Expert (MOE)-based decoder, where each expert is built on MHA, for discriminating the graph embeddings to generate different kinds of words. Since both the encoder and decoder are built based on the MHA, as a result, we construct a homogeneous encoder-decoder unlike the previous heterogeneous ones which usually apply Fully-Connected-based GNN and LSTM-based decoder. The homogeneous architecture enables us to unify the training configuration of the whole model instead of specifying different training strategies for diverse sub-networks as in the heterogeneous pipeline, which releases the training difficulty. Extensive experiments on the MS-COCO captioning benchmark validate the effectiveness of our TSG. The code is in: https://anonymous.4open.science/r/ACL23_TSG.
翻译:我们提出将场景图变换(TSG)为更具描述性的图像描述。在TSG中,我们应用多头注意力机制(MHA)设计图神经网络(GNN)以嵌入场景图。嵌入后,不同图嵌入包含用于生成不同词性单词的多样化专门知识,例如,物体/属性嵌入有助于生成名词/形容词。受此启发,我们设计了一个基于混合专家模型(MOE)的解码器,其中每个专家基于MHA构建,用于区分图嵌入以生成不同类型的单词。由于编码器和解码器均基于MHA构建,我们最终构建了同构的编码器-解码器架构,区别于以往通常采用全连接GNN和基于LSTM解码器的异构架构。这种同构架构使我们能够统一整个模型的训练配置,无需像异构流程那样为不同子网络指定不同训练策略,从而降低了训练难度。在MS-COCO图像描述基准上的大量实验验证了我们TSG的有效性。代码链接:https://anonymous.4open.science/r/ACL23_TSG。