Past work has shown that paired vision-language signals substantially improve grammar induction in multimodal datasets such as MSCOCO. We investigate whether advancements in large language models (LLMs) that are only trained with text could provide strong assistance for grammar induction in multimodal settings. We find that our text-only approach, an LLM-based C-PCFG (LC-PCFG), outperforms previous multi-modal methods, and achieves state-of-the-art grammar induction performance for various multimodal datasets. Compared to image-aided grammar induction, LC-PCFG outperforms the prior state-of-the-art by 7.9 Corpus-F1 points, with an 85% reduction in parameter count and 1.7x faster training speed. Across three video-assisted grammar induction benchmarks, LC-PCFG outperforms prior state-of-the-art by up to 7.7 Corpus-F1, with 8.8x faster training. These results shed light on the notion that text-only language models might include visually grounded cues that aid in grammar induction in multimodal contexts. Moreover, our results emphasize the importance of establishing a robust vision-free baseline when evaluating the benefit of multimodal approaches.
翻译:已有研究表明,在MSCOCO等多模态数据集中,配对视觉-语言信号能显著提升语法归纳性能。本文探究仅经文本训练的大型语言模型(LLMs)能否为多模态场景下的语法归纳提供有力支持。我们发现,纯文本方法——基于LLM的C-PCFG(LC-PCFG)——不仅优于既往多模态方法,更在多个多模态数据集上取得了语法归纳的当前最优性能。相较于图像辅助的语法归纳,LC-PCFG的Corpus-F1值较先前最优方法提升7.9个点,参数量减少85%,训练速度提升1.7倍。在三个视频辅助语法归纳基准测试中,LC-PCFG的Corpus-F1值最高提升7.7个点,训练速度提升8.8倍。这些结果揭示了一个重要发现:纯文本语言模型可能包含有助于多模态语境下语法归纳的视觉基础线索。此外,本研究结果强调了在评估多模态方法效益时,建立稳健无视觉基线的重要性。