Scientific compound figures combine multiple labeled panels into a single image. However, in a PMC-scale crawl of 346,567 compound figures, 16.3% have no caption and 1.8% only have captions shorter than ten words, causing them to be discarded by existing caption-decomposition pipelines. We propose FigEx2, a visual-conditioned framework that localizes panels and generates panel-wise captions directly from the image, converting otherwise unusable figures into aligned panel-text pairs for downstream pretraining and retrieval. To mitigate linguistic variance in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively controls how caption features condition the detection query space, and employ a staged SFT+RL strategy with CLIP-based alignment and BERTScore-based semantic rewards. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. FigEx2 achieves 0.728 [email protected]:0.95 for detection, outperforms Qwen3-VL-8B by 0.44 in METEOR and 0.22 in BERTScore, and transfers zero-shot to out-of-distribution scientific domains without fine-tuning.
翻译:科学复合图像将多个带标注的面板整合为一张整体图像。然而,在涵盖346,567幅复合图像的PMC级数据爬取中,16.3%的图像没有标题描述,1.8%的图像仅包含不足十个词的简短标题,导致现有基于标题描述的分解流程将其丢弃。我们提出FigEx2框架——一种基于视觉条件的框架,可直接从图像中定位面板并生成逐面板的标题描述,将原本不可用的图像转化为对齐的面板-文本对,用于下游预训练和检索任务。为缓解开放式描述生成中语言表达差异问题,我们引入噪声感知门控融合模块,自适应调节描述特征对检测查询空间的约束方式;并采用分阶段的有监督微调与强化学习策略,结合基于CLIP的对齐奖励与基于BERTScore的语义奖励。为支撑高质量监督训练,我们构建了BioSci-Fig-Cap基准数据集,该数据集为面板级语义对齐提供精细化标注,同时配套物理与化学学科的跨领域测试集。FigEx2在检测任务上达到0.728 [email protected]:0.95,在METEOR指标上超越Qwen3-VL-8B模型0.44分、在BERTScore指标上超越0.22分,且无需微调即可零样本迁移至分布外科学领域。