The rapid growth of streaming video applications demands multimodal models with enhanced capabilities for temporal dynamics understanding and complex reasoning. However, current Video Question Answering (VideoQA) datasets suffer from two critical limitations: 1) Static annotation mechanisms fail to capture the evolving nature of answers in temporal video streams, and 2) The absence of explicit reasoning process annotations restricts model interpretability and logical deduction capabilities. To address these challenges, We introduce StreamingCoT, the first dataset explicitly designed for temporally evolving reasoning in streaming VideoQA and multimodal Chain-of-Thought (CoT) tasks. Our framework first establishes a dynamic hierarchical annotation architecture that generates per-second dense descriptions and constructs temporally-dependent semantic segments through similarity fusion, paired with question-answer sets constrained by temporal evolution patterns. We further propose an explicit reasoning chain generation paradigm that extracts spatiotemporal objects via keyframe semantic alignment, derives object state transition-based reasoning paths using large language models, and ensures logical coherence through human-verified validation. This dataset establishes a foundation for advancing research in streaming video understanding, complex temporal reasoning, and multimodal inference. Our StreamingCoT and its construction toolkit can be accessed at https://github.com/Fleeting-hyh/StreamingCoT.
翻译:流媒体视频应用的快速增长要求多模态模型具备更强的时序动态理解与复杂推理能力。然而,当前视频问答(VideoQA)数据集存在两个关键局限:1)静态标注机制无法捕捉时序视频流中答案的演化特性;2)缺乏显式推理过程标注限制了模型的可解释性与逻辑演绎能力。为应对这些挑战,我们提出了StreamingCoT——首个专门为流媒体视频问答中的时序演化推理及多模态思维链(CoT)任务设计的数据集。我们的框架首先建立动态分层标注架构,该架构生成每秒的密集描述,并通过相似性融合构建时序依赖的语义片段,同时配以受时序演化模式约束的问题-答案对。我们进一步提出显式推理链生成范式:通过关键帧语义对齐提取时空对象,利用大语言模型推导基于对象状态转移的推理路径,并通过人工验证确保逻辑一致性。该数据集为推进流媒体视频理解、复杂时序推理及多模态推断研究奠定了基础。StreamingCoT数据集及其构建工具包可通过 https://github.com/Fleeting-hyh/StreamingCoT 访问。