Multimodal video-to-text models have made considerable progress, primarily in generating brief descriptions of video content. However, there is still a deficiency in generating rich long-form text descriptions that integrate both video and audio. In this paper, we introduce a framework called M2S, designed to generate novel-length text by combining audio, video, and character recognition. M2S includes modules for video long-form text description and comprehension, audio-based analysis of emotion, speech rate, and character alignment, and visual-based character recognition alignment. By integrating multimodal information using the large language model GPT4o, M2S stands out in the field of multimodal text generation. We demonstrate the effectiveness and accuracy of M2S through comparative experiments and human evaluation. Additionally, the model framework has good scalability and significant potential for future research.
翻译:多模态视频到文本模型已取得显著进展,主要集中于生成视频内容的简短描述。然而,在生成融合视频与音频的丰富长文本描述方面仍存在不足。本文提出一个名为M2S的框架,旨在通过结合音频、视频及角色识别来生成长篇小说级别的文本。M2S包含视频长文本描述与理解模块、基于音频的情感与语速分析及角色对齐模块,以及基于视觉的角色识别对齐模块。通过利用大语言模型GPT4o整合多模态信息,M2S在多模态文本生成领域表现突出。我们通过对比实验和人工评估验证了M2S的有效性与准确性。此外,该模型框架具有良好的可扩展性,对未来研究具有重要潜力。