Large language models (LLMs) and large multimodal models (LMMs) have significantly impacted the AI community, industry, and various economic sectors. In journalism, integrating AI poses unique challenges and opportunities, particularly in enhancing the quality and efficiency of news reporting. This study explores how LLMs and LMMs can assist journalistic practice by generating contextualised captions for images accompanying news articles. We conducted experiments using the GoodNews dataset to evaluate the ability of LMMs (BLIP-2, GPT-4v, or LLaVA) to incorporate one of two types of context: entire news articles, or extracted named entities. In addition, we compared their performance to a two-stage pipeline composed of a captioning model (BLIP-2, OFA, or ViT-GPT2) with post-hoc contextualisation with LLMs (GPT-4 or LLaMA). We assess a diversity of models, and we find that while the choice of contextualisation model is a significant factor for the two-stage pipelines, this is not the case in the LMMs, where smaller, open-source models perform well compared to proprietary, GPT-powered ones. Additionally, we found that controlling the amount of provided context enhances performance. These results highlight the limitations of a fully automated approach and underscore the necessity for an interactive, human-in-the-loop strategy.
翻译:大型语言模型(LLMs)与大型多模态模型(LMMs)已对人工智能学界、产业界及各经济领域产生显著影响。在新闻行业中,人工智能的融合带来了独特的挑战与机遇,尤其在提升新闻报道质量与效率方面。本研究探讨了如何利用LLMs与LMMs辅助新闻实践,为新闻文章中的配图生成情境化描述。我们基于GoodNews数据集开展实验,评估了LMMs(BLIP-2、GPT-4v或LLaVA)整合两种情境信息的能力:完整新闻文章或提取的命名实体。此外,我们将其性能与两阶段流程进行比较,该流程由图像描述模型(BLIP-2、OFA或ViT-GPT2)与LLMs(GPT-4或LLaMA)的后置情境化模块构成。通过对多种模型的评估,我们发现:对于两阶段流程,情境化模型的选择是关键因素;而在LMMs中,较小规模的开源模型与基于GPT的专有模型相比表现相当。此外,控制输入情境信息的量级可提升模型性能。这些结果揭示了全自动化方法的局限性,并强调了采用人机交互、人在回路的策略之必要性。