TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

The ultimate goal for foundation models is realizing task-agnostic, i.e., supporting out-of-the-box usage without task-specific fine-tuning. Although breakthroughs have been made in natural language processing and image representation learning, it is still challenging for video models to reach it due to the increasing uncertainty of spatiotemporal signals. To ease training, existing works leverage image foundation models' prior knowledge and equip them with efficient temporal modules. Despite the satisfactory fine-tuning performance, we empirically find they fall short of out-of-the-box usage, given the even degraded performance in zero-shot/linear protocols compared to their baseline counterparts. In this work, we analyze the factor that leads to degradation from the perspective of language supervision distortion. We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers. The overfitted text encoder, in turn, provides a harmful supervision signal, degrading the video representation. To tackle this issue, we propose a degradation-free pre-training strategy to retain the generalization ability of the text encoder via freezing shallow layers while enabling the task-related semantics capturing in tunable deep layers. As for the training objective, we adopted the transcript sorting task in TVTS incorporated with masking techniques to enable scalable training. As a result, we produce a series of models, dubbed TVTSv2, with up to one billion parameters. We achieve new state-of-the-arts on various video benchmarks with a frozen backbone, surpassing the recent ImageBind, InternVideo, etc. Code is available at https://github.com/TencentARC/TVTS.

翻译：基础模型的终极目标是实现任务无关性，即无需任务特定微调即可开箱即用。尽管自然语言处理和图像表征学习已取得突破性进展，但由于时空信号不确定性的增加，视频模型仍难以达到这一目标。为简化训练，现有工作利用图像基础模型的先验知识并配备高效时序模块。尽管微调性能令人满意，但我们通过实验发现，在零样本/线性评估协议中，此类模型相较于基线版本甚至出现了性能退化，难以支持开箱即用。本研究从语言监督失真的角度分析了导致退化的因素，认为过往工作中采用的端到端文本编码器调优方式存在次优性——其可能因过度拟合风格特征而丧失对多语言语域语义的原始泛化能力。过度拟合的文本编码器会反向提供有害的监督信号，导致视频表征质量下降。针对该问题，我们提出了一种无退化预训练策略：冻结浅层以保持文本编码器的泛化能力，同时通过可调深层捕获任务相关语义。在训练目标方面，我们采用TVTS中的转录排序任务，并结合掩码技术实现可扩展训练。最终产出了参数规模高达十亿的TVTSv2系列模型。在固定主干网络的情况下，我们在多项视频基准测试中取得新最优结果，超越了近期发布的ImageBind、InternVideo等工作。代码已开源于https://github.com/TencentARC/TVTS。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日