Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300\,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o\'s content scores by 45\% and style scores by 12\%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.
翻译:可控描述对于实现精确的多模态对齐与指令遵循至关重要,然而现有模型往往缺乏细粒度控制能力与可靠的评估方案。为弥补这一不足,我们提出了AnyCap项目,这是一个涵盖模型、数据集与评估的集成解决方案。我们引入了AnyCapModel(ACM),一个轻量级的即插即用框架,可在无需重新训练基础模型的情况下,增强现有基础模型在全模态描述任务中的可控性。ACM复用基础模型生成的原始描述,同时结合用户指令与模态特征,以生成改进后的描述。为缓解可控多模态描述领域的数据稀缺问题,我们构建了AnyCapDataset(ACD),涵盖三种模态、28类用户指令类型及30万条高质量数据条目。我们进一步提出了AnyCapEval,这是一个通过解耦内容准确性与风格保真度来为可控描述提供更可靠评估指标的新基准。在AnyCapEval上,ACM显著提升了多种基础模型的描述质量。值得注意的是,ACM-8B将GPT-4o的内容得分提升了45%,风格得分提升了12%,并在MIA-Bench和VidCapBench等广泛使用的基准测试中也取得了显著提升。