Vision-language models (VLMs) are achieving increasingly strong performance on multimodal tasks. However, reasoning capabilities remain limited particularly for smaller VLMs, while those of large-language models (LLMs) have seen numerous improvements. We propose a technique to transfer capabilities from LLMs to VLMs. On the recently introduced ChartQA, our method obtains state-of-the-art performance when applied on the PaLI3-5B VLM by \citet{chen2023pali3}, while also enabling much better performance on PlotQA and FigureQA. We first improve the chart representation by continuing the pre-training stage using an improved version of the chart-to-table translation task by \citet{liu2023deplot}. We then propose constructing a 20x larger dataset than the original training set. To improve general reasoning capabilities and improve numerical operations, we synthesize reasoning traces using the table representation of charts. Lastly, our model is fine-tuned using the multitask loss introduced by \citet{hsieh2023distilling}. Our variant ChartPaLI-5B outperforms even 10x larger models such as PaLIX-55B without using an upstream OCR system, while keeping inference time constant compared to the PaLI3-5B baseline. When rationales are further refined with a simple program-of-thought prompt \cite{chen2023program}, our model outperforms the recently introduced Gemini Ultra and GPT-4V.
翻译:视觉语言模型在多模态任务中表现日益强劲。然而,推理能力仍然受限,尤其是对于较小的视觉语言模型,而大语言模型的推理能力却已取得诸多突破。我们提出一种从大语言模型向视觉语言模型迁移能力的技术。在近期提出的ChartQA基准上,我们的方法应用于\citet{chen2023pali3}的PaLI3-5B视觉语言模型时取得了最先进性能,同时在PlotQA和FigureQA上也实现了显著更优的表现。我们首先通过使用\citet{liu2023deplot}改进的图表到表格翻译任务继续预训练阶段,从而优化图表表征。随后我们构建了一个比原始训练集大20倍的数据集。为了提升通用推理能力并改进数值运算,我们利用图表的表格表征合成推理路径。最后,我们采用\citet{hsieh2023distilling}提出的多任务损失对模型进行微调。我们的变体ChartPaLI-5B在无需上游OCR系统的情况下,性能甚至超越了10倍规模的模型(如PaLIX-55B),同时相比PaLI3-5B基线保持推理时间不变。当通过简单的思维程序提示\cite{chen2023program}进一步优化推理依据后,我们的模型性能超过了近期推出的Gemini Ultra和GPT-4V。