The ability to extract compact, meaningful summaries from large-scale and multimodal data is critical for numerous applications, ranging from video analytics to medical reports. Prior methods in cross-modal summarization have often suffered from high computational overheads and limited interpretability. In this paper, we propose a \textit{Cross-Modal State-Space Graph Reasoning} (\textbf{CSS-GR}) framework that incorporates a state-space model with graph-based message passing, inspired by prior work on efficient state-space models. Unlike existing approaches relying on purely sequential models, our method constructs a graph that captures inter- and intra-modal relationships, allowing more holistic reasoning over both textual and visual streams. We demonstrate that our approach significantly improves summarization quality and interpretability while maintaining computational efficiency, as validated on standard multimodal summarization benchmarks. We also provide a thorough ablation study to highlight the contributions of each component.
翻译:从大规模多模态数据中提取紧凑、有意义的摘要能力,对于从视频分析到医疗报告等众多应用至关重要。现有的跨模态摘要方法通常存在计算开销高和可解释性有限的问题。本文提出一种\textit{跨模态状态空间图推理}(\textbf{CSS-GR})框架,该框架受高效状态空间模型先前工作的启发,将状态空间模型与基于图的消息传递相结合。与依赖纯序列模型的现有方法不同,我们的方法构建了一个捕捉模态间与模态内关系的图,从而允许对文本和视觉流进行更全面的推理。我们在标准多模态摘要基准上的验证表明,该方法在保持计算效率的同时,显著提高了摘要质量和可解释性。我们还提供了详尽的消融研究,以突出每个组件的贡献。