Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).
翻译:尽管多模态大语言模型(MLLMs)展现出强大的全模态感知能力,但其基于视听线索预测未来事件的能力在很大程度上仍未得到充分探索,因为现有基准主要侧重于回顾性理解。为弥补这一空白,我们提出了FutureOmni——首个旨在评估基于视听环境进行全模态未来预测的基准。被评估模型需执行跨模态的因果与时序推理,并有效利用内部知识来预测未来事件。FutureOmni通过一个可扩展的、大语言模型辅助的人机协同流程构建而成,包含来自8个主要领域的919个视频和1,034个多项选择题问答对。对13个全模态模型和7个纯视频模型的评估表明,当前系统在视听未来预测方面存在困难,尤其是在语音密集的场景中,最佳准确率(64.8%)由Gemini 3 Flash模型取得。为缓解这一局限,我们整理了一个包含7千样本的指令微调数据集,并提出了一种全模态未来预测(OFF)训练策略。在FutureOmni以及流行的视听和纯视频基准上的评估表明,OFF策略有效提升了未来预测能力和泛化性能。我们已公开所有代码(https://github.com/OpenMOSS/FutureOmni)和数据集(https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni)。