Leum-VL Technical Report

A short video succeeds not simply because of what it shows, but because of how it schedules attention -- yet current multimodal models lack the structural grammar to parse or produce this organization. Existing models can describe scenes, answer event-centric questions, and read on-screen text, but they are far less reliable at identifying timeline-grounded units such as hooks, cut rationales, shot-induced tension, and platform-facing packaging cues. We propose SV6D (Structured Video in Six Dimensions), inspired by professional storyboard practice in film and television production, a representation framework that decomposes internet-native video into six complementary structural dimensions -- subject, aesthetics, camera language, editing, narrative, and dissemination -- with each label tied to physically observable evidence on the timeline. We formalize a unified optimization objective over SV6D that combines Hungarian-matched temporal alignment, dimension-wise semantic label distance, and quality regularization. Building on this framework, we present Leum-VL-8B, an 8B video-language model that realizes the SV6D objective through an expert-driven post-training pipeline, further refined through verifiable reinforcement learning on perception-oriented tasks. Leum-VL-8B achieves 70.8 on VideoMME (w/o subtitles), 70.0 on MVBench, and 61.6 on MotionBench, while remaining competitive on general multimodal evaluations such as MMBench-EN. We also construct FeedBench, a benchmark for structure-sensitive short-video understanding. Our results indicate that the missing layer in video AI is not pixel generation but structural representation: grounded on the timeline, linked to visible evidence, and directly consumable by downstream workflows such as editing, retrieval, recommendation, and generation control, including text-heavy internet video formats with overlays and image-text layouts.

翻译：短视频的成功不仅取决于内容本身，更取决于其如何调度观众注意力——然而当前多模态模型缺乏解析或生成这种组织的结构语法。现有模型能够描述场景、回答基于事件的问题以及读取屏幕文字，但在识别基于时间线的结构单元（如钩子、剪辑逻辑、镜头张力与平台导向的包装线索）方面可靠性远不足。受影视制作中专业分镜实践的启发，我们提出SV6D（六维结构化视频框架）——一种将原生互联网视频解构为六个互补结构维度（主题、美学、镜头语言、剪辑、叙事与传播）的表征框架，其中每个标签均对应时间线上可观察的物理证据。我们形式化了SV6D的统一优化目标，融合匈牙利匹配式时间对齐、维度级语义标签距离与质量正则化。基于该框架，我们构建了Leum-VL-8B——一个8B参数视频语言模型，通过专家驱动的后训练流程实现SV6D目标，并进一步通过基于可验证强化学习的感知任务优化。Leum-VL-8B在VideoMME（无字幕）上达70.8分，MVBench达70.0分，MotionBench达61.6分，同时在MMBench-EN等通用多模态评估中保持竞争力。我们还构建了FeedBench——一个面向结构敏感型短视频理解的基准测试。实验结果表明：视频AI缺失的环节并非像素生成，而是结构表征——基于时间线、关联可视证据、可直接服务于剪辑、检索、推荐与生成控制等下游任务，包括含叠加元素与图文布局的密集文字型互联网视频格式。