Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for ``sight'' or ``hearing'' and perform zero-shot multimodal classification of videos in-context. Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks. This method points towards a promising new research direction in multimodal classification, demonstrating how an interplay between textual, visual and auditory machine learning models can enable more holistic video understanding.
翻译:尽管多模态机器学习模型迎来了令人振奋的新浪潮,但当前方法仍难以解读视频中不同模态间复杂的上下文关系。为超越仅强调简单活动或物体的现有方法,我们提出了一种新的模型无关方法,用于生成捕获多模态视频信息的详细文本描述。本方法利用大型语言模型(如GPT-3.5或Llama2)习得的广泛知识,对从BLIP-2、Whisper和ImageBind获取的视觉与听觉模态文本描述进行推理。无需额外微调视频-文本模型或数据集,我们证明现有LLM能够将这些多模态文本描述作为“视觉”或“听觉”的代理,在上下文中执行视频的零样本多模态分类。在UCF-101和Kinetics等流行动作识别基准上的评估表明,这些富含上下文的描述可成功应用于视频理解任务。该方法为多模态分类开辟了有前景的新研究方向,揭示了文本、视觉与听觉机器学习模型之间的交互如何实现更全面的视频理解。