Scientific compound figures combine multiple labeled panels into a single image, but captions in real pipelines are often missing or only provide figure-level summaries, making panel-level understanding difficult. In this paper, we propose FigEx2, visual-conditioned framework that localizes panels and generates panel-wise captions directly from the compound figure. To mitigate the impact of diverse phrasing in open-ended captioning, we introduce a noise-aware gated fusion module that adaptively filters token-level features to stabilize the detection query space. Furthermore, we employ a staged optimization strategy combining supervised learning with reinforcement learning (RL), utilizing CLIP-based alignment and BERTScore-based semantic rewards to enforce strict multimodal consistency. To support high-quality supervision, we curate BioSci-Fig-Cap, a refined benchmark for panel-level grounding, alongside cross-disciplinary test suites in physics and chemistry. Experimental results demonstrate that FigEx2 achieves a superior 0.726 mAP@0.5:0.95 for detection and significantly outperforms Qwen3-VL-8B by 0.51 in METEOR and 0.24 in BERTScore. Notably, FigEx2 exhibits remarkable zero-shot transferability to out-of-distribution scientific domains without any fine-tuning.
翻译:科学复合图表将多个带标签的面板整合于单幅图像中,但实际处理流程中的图注往往缺失或仅提供图表级摘要,导致面板级理解困难。本文提出FigEx2,一种视觉条件化框架,可直接从复合图表中定位面板并生成面板级描述。为缓解开放式描述中多样化措辞的影响,我们引入噪声感知门控融合模块,自适应过滤词元级特征以稳定检测查询空间。此外,我们采用结合监督学习与强化学习的分阶段优化策略,利用基于CLIP的对齐奖励和基于BERTScore的语义奖励来强化严格的多模态一致性。为支撑高质量监督,我们构建了面板级定位精炼基准BioSci-Fig-Cap,以及涵盖物理与化学领域的跨学科测试集。实验结果表明,FigEx2在检测任务中取得0.726 mAP@0.5:0.95的优异性能,并在METEOR和BERTScore指标上分别显著超越Qwen3-VL-8B模型0.51分和0.24分。值得注意的是,FigEx2在未经微调的情况下,对分布外科学领域展现出卓越的零样本迁移能力。