We compare the 0-shot performance of a neural caption-based image retriever when given as input either human-produced captions or captions generated by a neural captioner. We conduct this comparison on the recently introduced ImageCoDe data-set (Krojer et al., 2022) which contains hard distractors nearly identical to the images to be retrieved. We find that the neural retriever has much higher performance when fed neural rather than human captions, despite the fact that the former, unlike the latter, were generated without awareness of the distractors that make the task hard. Even more remarkably, when the same neural captions are given to human subjects, their retrieval performance is almost at chance level. Our results thus add to the growing body of evidence that, even when the ``language'' of neural models resembles English, this superficial resemblance might be deeply misleading.
翻译:我们比较了基于神经字幕的图像检索器在输入人类生成字幕与神经网络生成字幕时的零样本性能。该对比实验基于近期提出的ImageCoDe数据集(Krojer等人,2022),该数据集包含与目标检索图像近乎一致的强干扰项。研究发现,尽管神经字幕(与人类字幕不同)生成时并未考虑导致任务困难的干扰项信息,但神经检索器在处理神经字幕时的性能远高于人类字幕。更值得注意的是,当同一组神经字幕呈现给人类受试者时,其检索表现几乎接近随机水平。因此,我们的研究结果进一步印证了日益增长的证据:即便神经模型的"语言"表面上类似英语,这种表层相似性可能具有深刻的误导性。