Generating structured narrations for real-world e-commerce videos requires models to perceive fine-grained visual details and organize them into coherent, high-level stories--capabilities that existing approaches struggle to unify. We introduce the E-commerce Hierarchical Video Captioning (E-HVC) dataset with dual-granularity, temporally grounded annotations: a Temporal Chain-of-Thought that anchors event-level observations and Chapter Summary that compose them into concise, story-centric summaries. Rather than directly prompting chapters, we adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions, then refines coarse annotations into precise chapter boundaries and titles conditioned on the Temporal Chain-of-Thought, yielding fact-grounded, time-aligned narratives. We also observe that e-commerce videos are fast-paced and information-dense, with visual tokens dominating the input sequence. To enable efficient training while reducing input tokens, we propose the Scene-Primed ASR-anchored Compressor (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues. Built upon these designs, our HiVid-Narrator framework achieves superior narrative quality with fewer input tokens compared to existing methods.
翻译:为现实世界中的电子商务视频生成结构化叙述,要求模型能够感知细粒度的视觉细节,并将其组织成连贯的高层故事——这是现有方法难以统一实现的能力。我们引入了电子商务分层视频描述(E-HVC)数据集,该数据集包含双重粒度、基于时间戳的标注:一个锚定事件级观察的“时序思维链”,以及一个将这些观察组合成简洁、以故事为中心的“章节摘要”。我们并未直接生成章节,而是采用分阶段构建方法:首先通过精选的自动语音识别(ASR)和帧级描述收集可靠的语言和视觉证据,然后基于“时序思维链”将粗略的标注细化为精确的章节边界和标题,从而产生事实准确、时间对齐的叙事。我们还观察到,电子商务视频节奏快、信息密度高,视觉标记在输入序列中占主导地位。为了在减少输入标记的同时实现高效训练,我们提出了“场景引导的ASR锚定压缩器”(SPA-Compressor),它能够在ASR语义线索的引导下,将多模态标记压缩为分层的场景和事件表示。基于这些设计,我们的HiVid-Narrator框架与现有方法相比,能够以更少的输入标记实现更优的叙事质量。