We introduce DanmakuTPPBench, a comprehensive benchmark designed to advance multi-modal Temporal Point Process (TPP) modeling in the era of Large Language Models (LLMs). While TPPs have been widely studied for modeling temporal event sequences, existing datasets are predominantly unimodal, hindering progress in models that require joint reasoning over temporal, textual, and visual information. To address this gap, DanmakuTPPBench comprises two complementary components: (1) DanmakuTPP-Events, a novel dataset derived from the Bilibili video platform, where user-generated bullet comments (Danmaku) naturally form multi-modal events annotated with precise timestamps, rich textual content, and corresponding video frames; (2) DanmakuTPP-QA, a challenging question-answering dataset constructed via a novel multi-agent pipeline powered by state-of-the-art LLMs and multi-modal LLMs (MLLMs), targeting complex temporal-textual-visual reasoning. We conduct extensive evaluations using both classical TPP models and recent MLLMs, revealing significant performance gaps and limitations in current methods' ability to model multi-modal event dynamics. Our benchmark establishes strong baselines and calls for further integration of TPP modeling into the multi-modal language modeling landscape. Project page: https://github.com/FRENKIE-CHIANG/DanmakuTPPBench
翻译:我们提出了DanmakuTPPBench,这是一个旨在推动大语言模型时代多模态时序点过程建模的综合基准。尽管时序点过程在建模时序事件序列方面已得到广泛研究,但现有数据集主要是单模态的,这阻碍了需要联合推理时序、文本和视觉信息的模型的发展。为填补这一空白,DanmakuTPPBench包含两个互补的组成部分:(1)DanmakuTPP-Events,一个源自Bilibili视频平台的新型数据集,其中用户生成的弹幕自然形成了多模态事件,并标注有精确的时间戳、丰富的文本内容及对应的视频帧;(2)DanmakuTPP-QA,一个通过由先进的大语言模型和多模态大语言模型驱动的新型多智能体流程构建的、具有挑战性的问答数据集,专注于复杂的时序-文本-视觉推理。我们使用经典的时序点过程模型和近期的多模态大语言模型进行了广泛评估,揭示了当前方法在建模多模态事件动态方面存在的显著性能差距和局限性。我们的基准建立了强有力的基线,并呼吁将时序点过程建模进一步整合到多模态语言建模的领域中。项目页面:https://github.com/FRENKIE-CHIANG/DanmakuTPPBench