Processing video in vision-language models is expensive: each frame occupies hundreds of tokens, and inference cost scales with every frame and every repeated query. We introduce Frames2LoRA, a method for parametric video internalization. A perceiver hypernetwork reads the intermediate representations produced layer-by-layer as a frozen VLM encodes a video, and generates a Low-Rank Adaptation (LoRA) adapter in a single forward pass. Unlike standard LoRA fine-tuning, which requires iterative gradient updates, Frames2LoRA predicts these weights directly from the video. Trained for SmolVLM2 500M and 2.2B on video summarization and captioning, Frames2LoRA enables the same frozen VLM to answer queries from the adapter alone, with zero visual tokens in its context at query time. Frames2LoRA is statistically non-inferior and equivalent to direct video-in-context inference across all five captioning benchmarks at both model scales, and across seven of eight video question answering benchmark-scale pairings. Although trained only on 12 frames at 384px, it remains stable up to 1,024 frames and 1024px, where direct video-in-context inference often degenerates. Across this sweep, it reduces answer-time visual-token load by up to 1,500x and query TTFT by 6-80x, while preserving video-faithful outputs. We also find that independently generated adapters for non-overlapping video segments can compose in rank space, suggesting a path toward chunked long-video internalization.
翻译:处理视频在视觉语言模型中代价高昂:每帧占据数百个token,推理成本随每帧和每次重复查询而增加。我们提出Frames2LoRA,一种用于参数化视频内化的方法。感知器超网络逐层读取冻结VLM编码视频时产生的中间表征,并在单次前向传播中生成低秩适配(LoRA)适配器。与需要迭代梯度更新的标准LoRA微调不同,Frames2LoRA直接从视频预测这些权重。基于SmolVLM2 500M和2.2B模型在视频摘要与描述任务上进行训练后,Frames2LoRA使同一冻结VLM能够仅通过适配器回答查询,查询时上下文中无需任何视觉token。在两个模型规模下的所有五个描述基准测试中,Frames2LoRA在统计上不劣于且等效于直接视频上下文推理,在八个视频问答基准测试-规模配对中有七个达到同样效果。尽管仅在12帧384px分辨率下训练,该方法在高达1024帧和1024px分辨率下仍保持稳定——而直接视频上下文推理在此条件下常出现退化。在此扫描范围内,它将回答时刻的视觉token负载减少最高1500倍,查询TTFT降低6-80倍,同时保持视频忠实输出。我们还发现,针对非重叠视频片段独立生成的适配器可在秩空间中组合,这为分块长视频内化指明了一条可行路径。