Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose, $\textit{e.g.}$, looking at the specified regions or telling in a particular text style. State-of-the-art methods are trained on annotated pairs of input controls and output captions. However, the scarcity of such well-annotated multimodal data largely limits their usability and scalability for interactive AI systems. Leveraging unimodal instruction-following foundation models is a promising alternative that benefits from broader sources of data. In this paper, we present Caption AnyThing (CAT), a foundation model augmented image captioning framework supporting a wide range of multimodel controls: 1) visual controls, including points, boxes, and trajectories; 2) language controls, such as sentiment, length, language, and factuality. Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework, enabling the flexible combination between different controls. Extensive case studies demonstrate the user intention alignment capabilities of our framework, shedding light on effective user interaction modeling in vision-language applications. Our code is publicly available at https://github.com/ttengwang/Caption-Anything.
翻译:摘要:可控图像描述是一个新兴的多模态课题,旨在根据人类意图用自然语言描述图像,例如关注指定区域或采用特定文本风格。当前最先进的方法依赖于输入控制与输出描述之间带标注的配对数据进行训练。然而,此类高质量多模态标注数据的稀缺性极大限制了它们在交互式人工智能系统中的可用性与可扩展性。利用单模态指令遵循基础模型是一种有前景的替代方案,其优势在于可受益于更广泛的数据来源。本文提出 Caption AnyThing (CAT)——一种增强型基础模型图像描述框架,支持多种多模态控制:1) 视觉控制,包括点、框和轨迹;2) 语言控制,如情感、长度、语言和事实性。通过集成 Segment Anything Model (SAM) 与 ChatGPT,我们将视觉与语言提示统一到一个模块化框架中,实现不同控制之间的灵活组合。广泛的案例研究展示了本框架对齐用户意图的能力,为视觉-语言应用中的有效用户交互建模提供了启示。我们的代码已开源,详见 https://github.com/ttengwang/Caption-Anything。