In this work we present an approach for generating alternative text (or alt-text) descriptions for images shared on social media, specifically Twitter. More than just a special case of image captioning, alt-text is both more literally descriptive and context-specific. Also critically, images posted to Twitter are often accompanied by user-written text that despite not necessarily describing the image may provide useful context that if properly leveraged can be informative. We address this task with a multimodal model that conditions on both textual information from the associated social media post as well as visual signal from the image, and demonstrate that the utility of these two information sources stacks. We put forward a new dataset of 371k images paired with alt-text and tweets scraped from Twitter and evaluate on it across a variety of automated metrics as well as human evaluation. We show that our approach of conditioning on both tweet text and visual information significantly outperforms prior work, by more than 2x on BLEU@4.
翻译:本文提出了一种为社交媒体(特别是Twitter)上分享的图片生成替代文本(alt-text)描述的方法。替代文本不仅是图像字幕的一种特殊情形,还要求更字面的描述性和上下文特异性。此外,Twitter上发布的图片通常附有用户撰写的文本,这些文本虽未必直接描述图片,但若妥善利用,可能提供有价值的上下文信息。我们通过一种多模态模型来解决该任务,该模型同时利用社交媒体帖子中的文本信息与图片的视觉信号,并证明这两种信息源的效用具有叠加性。我们构建了一个包含37.1万张图片及其替代文本和推文(从Twitter抓取)的新数据集,并在该数据集上通过多种自动评估指标和人工评估进行了验证。结果表明,我们提出的(基于推文文本和视觉信息双重条件)方法显著优于先前工作,在BLEU@4指标上性能提升超过2倍。