We address the task of zero-shot video classification for extremely fine-grained actions (e.g., Windmill Dunk in basketball), where no video examples or temporal annotations are available for unseen classes. While image-language models (e.g., CLIP, SigLIP) show strong open-set recognition, they lack temporal modeling needed for video understanding. We propose ActAlign, a truly zero-shot, training-free method that formulates video classification as a sequence alignment problem, preserving the generalization strength of pretrained image-language models. For each class, a large language model (LLM) generates an ordered sequence of sub-actions, which we align with video frames using Dynamic Time Warping (DTW) in a shared embedding space. Without any video-text supervision or fine-tuning, ActAlign achieves 30.5% accuracy on ActionAtlas--the most diverse benchmark of fine-grained actions across multiple sports--where human performance is only 61.6%. ActAlign outperforms billion-parameter video-language models while using 8x fewer parameters. Our approach is model-agnostic and domain-general, demonstrating that structured language priors combined with classical alignment methods can unlock the open-set recognition potential of image-language models for fine-grained video understanding.
翻译:本文研究面向极度细粒度动作(例如篮球中的"风车扣篮")的零样本视频分类任务,其中对于未见类别既无视频示例也无时序标注可用。尽管图像-语言模型(如CLIP、SigLIP)展现出强大的开放集识别能力,但它们缺乏视频理解所需的时序建模能力。我们提出ActAlign——一种真正的零样本、免训练方法,将视频分类构建为序列对齐问题,同时保持预训练图像-语言模型的泛化优势。对于每个类别,大型语言模型(LLM)生成有序的子动作序列,我们通过在共享嵌入空间中使用动态时间规整(DTW)将其与视频帧进行对齐。在完全不需要视频-文本监督或微调的情况下,ActAlign在ActionAtlas(跨多项运动的最多样化细粒度动作基准)上达到30.5%的准确率,而人类表现仅为61.6%。ActAlign在参数量减少8倍的同时,性能超越数十亿参数规模的视频-语言模型。我们的方法具有模型无关性和领域通用性,证明结构化语言先验与经典对齐方法相结合,能够释放图像-语言模型在细粒度视频理解中的开放集识别潜力。