ROME: Testing Image Captioning Systems via Recursive Object Melting

Image captioning (IC) systems aim to generate a text description of the salient objects in an image. In recent years, IC systems have been increasingly integrated into our daily lives, such as assistance for visually-impaired people and description generation in Microsoft Powerpoint. However, even the cutting-edge IC systems (e.g., Microsoft Azure Cognitive Services) and algorithms (e.g., OFA) could produce erroneous captions, leading to incorrect captioning of important objects, misunderstanding, and threats to personal safety. The existing testing approaches either fail to handle the complex form of IC system output (i.e., sentences in natural language) or generate unnatural images as test cases. To address these problems, we introduce Recursive Object MElting (Rome), a novel metamorphic testing approach for validating IC systems. Different from existing approaches that generate test cases by inserting objects, which easily make the generated images unnatural, Rome melts (i.e., remove and inpaint) objects. Rome assumes that the object set in the caption of an image includes the object set in the caption of a generated image after object melting. Given an image, Rome can recursively remove its objects to generate different pairs of images. We use Rome to test one widely-adopted image captioning API and four state-of-the-art (SOTA) algorithms. The results show that the test cases generated by Rome look much more natural than the SOTA IC testing approach and they achieve comparable naturalness to the original images. Meanwhile, by generating test pairs using 226 seed images, Rome reports a total of 9,121 erroneous issues with high precision (86.47%-92.17%). In addition, we further utilize the test cases generated by Rome to retrain the Oscar, which improves its performance across multiple evaluation metrics.

翻译：图像描述（IC）系统旨在生成图像中显著对象的文本描述。近年来，IC系统已日益融入日常生活，例如辅助视障人士及Microsoft PowerPoint中的描述生成。然而，即便是最先进的IC系统（如Microsoft Azure认知服务）和算法（如OFA）也可能产生错误描述，导致重要对象描述不准确、误解乃至人身安全威胁。现有测试方法要么难以处理IC系统输出的复杂形式（即自然语言句子），要么生成不自然的图像作为测试用例。为解决这些问题，我们提出递归对象熔化（Recursive Object Melting, Rome），一种用于验证IC系统的新型蜕变测试方法。与现有通过插入对象生成测试用例（易导致生成图像不自然）的方法不同，Rome采用熔化（即移除并修复）对象的方式。Rome假设原始图像描述中的对象集包含对象熔化后生成图像描述中的对象集。给定一张图像，Rome可递归移除其对象以生成不同图像对。我们利用Rome测试了一个广泛采用的图像描述API和四种最先进（SOTA）算法。结果表明，Rome生成的测试用例比SOTA IC测试方法生成的图像自然得多，且与原图自然度相当。同时，通过使用226张种子图像生成测试对，Rome共报告9121个错误问题，精确度达86.47%-92.17%。此外，我们进一步利用Rome生成的测试用例重新训练Oscar模型，提升了其在多项评估指标上的性能。

相关内容

CASES

关注 4

CASES：International Conference on Compilers, Architectures, and Synthesis for Embedded Systems。 Explanation：嵌入式系统编译器、体系结构和综合国际会议。 Publisher：ACM。 SIT： http://dblp.uni-trier.de/db/conf/cases/index.html