Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose, $\textit{e.g.}$, looking at the specified regions or telling in a particular text style. State-of-the-art methods are trained on annotated pairs of input controls and output captions. However, the scarcity of such well-annotated multimodal data largely limits their usability and scalability for interactive AI systems. Leveraging unimodal instruction-following foundation models is a promising alternative that benefits from broader sources of data. In this paper, we present Caption AnyThing (CAT), a foundation model augmented image captioning framework supporting a wide range of multimodel controls: 1) visual controls, including points, boxes, and trajectories; 2) language controls, such as sentiment, length, language, and factuality. Powered by Segment Anything Model (SAM) and ChatGPT, we unify the visual and language prompts into a modularized framework, enabling the flexible combination between different controls. Extensive case studies demonstrate the user intention alignment capabilities of our framework, shedding light on effective user interaction modeling in vision-language applications. Our code is publicly available at https://github.com/ttengwang/Caption-Anything.
翻译:摘要:可控图像描述是一个新兴的多模态主题,旨在遵循人类意图(例如,关注指定区域或以特定文本风格表达)用自然语言描述图像。现有最先进方法基于输入控制与输出描述的标注对进行训练。然而,这类高质量标注多模态数据的稀缺性严重限制了其在交互式人工智能系统中的可用性和可扩展性。利用单模态指令遵循基础模型是一种有前景的替代方案,可从更广泛的数据源中获益。本文提出Caption AnyThing(CAT)——一种增强基础模型的图像描述框架,支持多样化多模态控制:1)视觉控制,包括点、框和轨迹;2)语言控制,如情感、长度、语言和事实性。通过集成Segment Anything Model(SAM)和ChatGPT,我们将视觉和语言提示统一到一个模块化框架中,实现了不同控制间的灵活组合。大量案例研究证明了该框架对齐用户意图的能力,为视觉-语言应用中的有效用户交互建模提供了启示。我们的代码已开源:https://github.com/ttengwang/Caption-Anything。