This short position paper provides a manually curated list of non-English image captioning datasets (as of May 2024). Through this list, we can observe the dearth of datasets in different languages: only 23 different languages are represented. With the addition of the Crossmodal-3600 dataset (Thapliyal et al., 2022, 36 languages) this number increases somewhat, but still this number is small compared to the +/-500 institutional languages that are out there. This paper closes with some open questions for the field of Vision & Language.
翻译:本简短立场论文提供了一份截至2024年5月人工整理的非英语图像描述数据集清单。通过该清单可观察到多语言数据集的匮乏现状:仅涵盖23种不同语言。尽管Crossmodal-3600数据集(Thapliyal等人,2022年,含36种语言)的发布使该数量有所增加,但相较于现存约500种制度化语言而言,其规模仍显不足。本文最后提出了视觉与语言研究领域若干待解决的前沿问题。