We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language model (LLM) that can jointly capture the detailed spatial semantics and long-range temporal context without exceeding the token budget of commonly used LLMs. This is realized by using a two-stream SlowFast design of inputs for Video LLMs to aggregate features from sampled video frames in an effective way. Specifically, the Slow pathway extracts features at a low frame rate while keeping as many spatial details as possible (e.g., with 24x24 tokens), and the Fast pathway operates on a high frame rate but uses a larger spatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. As a result, this design allows us to adequately capture both spatial and temporal features that are beneficial for understanding details along the video. Experimental results show that SF-LLaVA outperforms existing training-free methods on a wide range of video tasks. On some benchmarks, it achieves comparable or even better performance compared to state-of-the-art Video LLMs that are fine-tuned on video datasets.
翻译:我们提出了 SlowFast-LLaVA(简称 SF-LLaVA),一种无需训练的视频大语言模型,它能够在不超出常用大语言模型令牌预算的前提下,联合捕获详细的空间语义和长程时序上下文。这是通过为视频大语言模型采用双流 SlowFast 输入设计来实现的,该设计能以高效的方式聚合采样视频帧的特征。具体而言,慢速通路以低帧率提取特征,同时尽可能保留更多空间细节(例如,使用 24x24 的令牌),而快速通路则以高帧率运行,但使用更大的空间池化步长(例如,下采样 6 倍)以专注于运动线索。因此,这种设计使我们能够充分捕获对理解视频细节有益的空间和时序特征。实验结果表明,SF-LLaVA 在广泛的视频任务上优于现有的无需训练方法。在某些基准测试中,其性能与在视频数据集上微调的最先进视频大语言模型相当甚至更优。