Image captioning (IC) systems, such as Microsoft Azure Cognitive Service, translate image content into descriptive language but can generate inaccuracies leading to misinterpretations. Advanced testing techniques like MetaIC and ROME aim to address these issues but face significant challenges. These methods require intensive manual labor for detailed annotations and often produce unrealistic images, either by adding unrelated objects or failing to remove existing ones. Additionally, they generate limited test suites, with MetaIC restricted to inserting specific objects and ROME limited to a narrow range of variations. We introduce SPOLRE, a novel automated tool for semantic-preserving object layout reconstruction in IC system testing. SPOLRE leverages four transformation techniques to modify object layouts without altering the image's semantics. This automated approach eliminates the need for manual annotations and creates realistic, varied test suites. Our tests show that over 75% of survey respondents find SPOLRE-generated images more realistic than those from state-of-the-art methods. SPOLRE excels in identifying caption errors, detecting 31,544 incorrect captions across seven IC systems with an average precision of 91.62%, surpassing other methods which average 85.65% accuracy and identify 17,160 incorrect captions. Notably, SPOLRE identified 6,236 unique issues within Azure, demonstrating its effectiveness against one of the most advanced IC systems.
翻译:图像描述系统(如微软Azure认知服务)能够将图像内容转换为描述性语言,但可能生成不准确的描述从而导致误解。现有的高级测试技术(如MetaIC和ROME)旨在解决这些问题,但仍面临重大挑战。这些方法需要大量人工劳动进行详细标注,且生成的图像往往不真实,要么添加了无关对象,要么未能移除现有对象。此外,它们生成的测试套件规模有限:MetaIC仅限于插入特定对象,而ROME仅能产生有限的变化范围。本文提出SPOLRE,一种用于图像描述系统测试的新型自动化语义保持对象布局重构工具。SPOLRE利用四种变换技术在不改变图像语义的前提下修改对象布局。这种自动化方法无需人工标注,并能创建真实且多样化的测试套件。我们的测试表明,超过75%的受访者认为SPOLRE生成的图像比现有先进方法生成的图像更为真实。SPOLRE在识别描述错误方面表现卓越:在七个图像描述系统中检测到31,544条错误描述,平均精确率达到91.62%,优于其他方法(平均准确率85.65%,检测到17,160条错误描述)。值得注意的是,SPOLRE在Azure系统中识别出6,236个独特问题,这证明其对最先进的图像描述系统同样具有显著检测效果。