We propose to Transform Scene Graphs (TSG) into more descriptive captions. In TSG, we apply multi-head attention (MHA) to design the Graph Neural Network (GNN) for embedding scene graphs. After embedding, different graph embeddings contain diverse specific knowledge for generating the words with different part-of-speech, e.g., object/attribute embedding is good for generating nouns/adjectives. Motivated by this, we design a Mixture-of-Expert (MOE)-based decoder, where each expert is built on MHA, for discriminating the graph embeddings to generate different kinds of words. Since both the encoder and decoder are built based on the MHA, as a result, we construct a homogeneous encoder-decoder unlike the previous heterogeneous ones which usually apply Fully-Connected-based GNN and LSTM-based decoder. The homogeneous architecture enables us to unify the training configuration of the whole model instead of specifying different training strategies for diverse sub-networks as in the heterogeneous pipeline, which releases the training difficulty. Extensive experiments on the MS-COCO captioning benchmark validate the effectiveness of our TSG. The code is in: https://anonymous.4open.science/r/ACL23_TSG.
翻译:我们提出将场景图(TSG)转化为更具描述性的标题。在TSG中,我们采用多头注意力(MHA)设计图神经网络(GNN)以嵌入场景图。嵌入后,不同的图嵌入包含生成不同词性单词所需的特定知识,例如物体/属性嵌入有助于生成名词/形容词。基于此,我们设计了一个基于混合专家(MOE)的解码器,每个专家由MHA构建,用于区分图嵌入以生成不同类型的单词。由于编码器和解码器均基于MHA构建,因此我们构建了同构的编码器-解码器,不同于以往通常采用全连接GNN和LSTM解码器的异构架构。同构架构使我们能够统一整个模型的训练配置,无需像异构流程那样为不同子网络指定不同训练策略,从而降低了训练难度。在MS-COCO字幕生成基准上的大量实验验证了我们TSG的有效性。代码地址:https://anonymous.4open.science/r/ACL23_TSG