The ultimate goal for foundation models is realizing task-agnostic, i.e., supporting out-of-the-box usage without task-specific fine-tuning. Although breakthroughs have been made in natural language processing and image representation learning, it is still challenging for video models to reach it due to the increasing uncertainty of spatiotemporal signals. To ease training, existing works leverage image foundation models' prior knowledge and equip them with efficient temporal modules. Despite the satisfactory fine-tuning performance, we empirically find they fall short of out-of-the-box usage, given the even degraded performance in zero-shot/linear protocols compared to their baseline counterparts. In this work, we analyze the factor that leads to degradation from the perspective of language supervision distortion. We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers. The overfitted text encoder, in turn, provides a harmful supervision signal, degrading the video representation. To tackle this issue, we propose a degradation-free pre-training strategy to retain the generalization ability of the text encoder via freezing shallow layers while enabling the task-related semantics capturing in tunable deep layers. As for the training objective, we adopted the transcript sorting task in TVTS incorporated with masking techniques to enable scalable training. As a result, we produce a series of models, dubbed TVTSv2, with up to one billion parameters. We achieve new state-of-the-arts on various video benchmarks with a frozen backbone, surpassing the recent ImageBind, InternVideo, etc. Code is available at https://github.com/TencentARC/TVTS.
翻译:基础模型的终极目标是实现任务无关性,即无需任务特定微调即可开箱即用。尽管自然语言处理和图像表征学习已取得突破性进展,但由于时空信号不确定性的增加,视频模型仍难以达到这一目标。为简化训练,现有工作利用图像基础模型的先验知识并配备高效时序模块。尽管微调性能令人满意,但我们通过实验发现,在零样本/线性评估协议中,此类模型相较于基线版本甚至出现了性能退化,难以支持开箱即用。本研究从语言监督失真的角度分析了导致退化的因素,认为过往工作中采用的端到端文本编码器调优方式存在次优性——其可能因过度拟合风格特征而丧失对多语言语域语义的原始泛化能力。过度拟合的文本编码器会反向提供有害的监督信号,导致视频表征质量下降。针对该问题,我们提出了一种无退化预训练策略:冻结浅层以保持文本编码器的泛化能力,同时通过可调深层捕获任务相关语义。在训练目标方面,我们采用TVTS中的转录排序任务,并结合掩码技术实现可扩展训练。最终产出了参数规模高达十亿的TVTSv2系列模型。在固定主干网络的情况下,我们在多项视频基准测试中取得新最优结果,超越了近期发布的ImageBind、InternVideo等工作。代码已开源于https://github.com/TencentARC/TVTS。