Transferring visual-language knowledge from large-scale foundation models for video recognition has proved to be effective. To bridge the domain gap, additional parametric modules are added to capture the temporal information. However, zero-shot generalization diminishes with the increase in the number of specialized parameters, making existing works a trade-off between zero-shot and close-set performance. In this paper, we present MoTE, a novel framework that enables generalization and specialization to be balanced in one unified model. Our approach tunes a mixture of temporal experts to learn multiple task views with various degrees of data fitting. To maximally preserve the knowledge of each expert, we propose \emph{Weight Merging Regularization}, which regularizes the merging process of experts in weight space. Additionally with temporal feature modulation to regularize the contribution of temporal feature during test. We achieve a sound balance between zero-shot and close-set video recognition tasks and obtain state-of-the-art or competitive results on various datasets, including Kinetics-400 \& 600, UCF, and HMDB. Code is available at \url{https://github.com/ZMHH-H/MoTE}.
翻译:将大规模基础模型的视觉语言知识迁移至视频识别任务已被证明是行之有效的。为弥合领域差距,现有方法通常引入额外的参数化模块以捕捉时序信息。然而,随着专用参数数量的增加,模型的零样本泛化能力会随之减弱,这使得现有工作不得不在零样本性能与闭集性能之间进行权衡。本文提出MoTE,一种新颖的框架,能够在统一模型中实现泛化能力与专业化能力的平衡。我们的方法通过调整时序专家混合体来学习具有不同程度数据拟合能力的多任务视图。为最大限度地保留每位专家的知识,我们提出**权重合并正则化**,在权重空间中对专家的合并过程进行正则化约束。此外,通过时序特征调制来规范测试过程中时序特征的贡献度。我们在零样本与闭集视频识别任务之间实现了良好的平衡,并在多个数据集(包括Kinetics-400 & 600、UCF和HMDB)上取得了最先进或具有竞争力的结果。代码发布于 \url{https://github.com/ZMHH-H/MoTE}。