SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation

Estimating human proficiency from video is a key challenge for automated skill assessment, with applications in sports coaching, music pedagogy, surgical training, and workplace learning. Existing approaches often focus on individual scenarios or rely on shared multi-view aggregation, limiting their ability to adapt to heterogeneous camera viewpoints and activity domains. We introduce SkillMoV, a unified, parameter-efficient framework for multi-scenario proficiency estimation from synchronized multi-view video. At its core, SkillMoV introduces a Mixture-of-View Projector (MoVP), which adapts the mixture-of-experts paradigm to camera-specific view features. MoVP is composed of four stages: (i) a Mixture-of-View soft router with twelve expert MLPs that learns view-dependent expert preferences without camera-identity supervision; (ii) cross-view attention to align synchronized cameras; (iii) learnable prototype anchoring to condition the representation on class-level reference vectors; and (iv) a prototype-conditioned gated projection that produces the final skill embedding. We evaluate SkillMoV on EgoExo4D across six skill domains and three separately trained view configurations: Ego, Exos, and Ego+Exos. SkillMoV reaches 50.17% overall accuracy in the Exos setting with a single model trained jointly across all scenarios, surpassing the strongest reported Exos result among the compared methods by 3.57 percentage points. In Ego+Exos, SkillMoV remains close to the best reported result in that setting (47.63% versus 48.20%). Ablations on the selected Exos configuration validate each component: MoV routing contributes +6.61 pp over attentive aggregation, cross-view attention +4.92 pp, prototype anchoring +4.07 pp, and stochastic view dropout +3.90 pp. Through LoRA adaptation, SkillMoV trains only 23.32% of its parameters and adds limited measured overhead relative to a LoRA-only baseline.

翻译：从视频中估计人类熟练度是自动技能评估的关键挑战，广泛应用于体育训练、音乐教学、外科手术培训及职场学习等领域。现有方法通常聚焦于单一场景或依赖共享的多视角聚合，难以适应异构摄像机视角与活动领域。我们提出SkillMoV——一个统一且参数高效的框架，用于从同步多视角视频中进行多场景熟练度估计。其核心是引入了多视角投影混合模型（MoVP），将混合专家范式适配至摄像机特定视角特征。MoVP由四个阶段构成：（i）一个包含12个专家MLP的多视角软路由器，无需摄像机身份标注即可学习视角相关的专家偏好；（ii）跨视角注意力机制，用于对齐同步摄像机；（iii）可学习的原型锚定，将表征条件化于类别级参考向量；（iv）基于原型条件的门控投影，生成最终技能嵌入。我们在EgoExo4D数据集上评估SkillMoV，覆盖六个技能领域及三种独立训练的视角配置：Ego、Exos与Ego+Exos。在Exos配置下，SkillMoV通过跨所有场景联合训练的单一模型达到50.17%的总体准确率，超越对比方法中已报告的最佳Exos结果3.57个百分点。在Ego+Exos配置下，SkillMoV接近该场景下的最佳报告结果（47.63%对48.20%）。针对选定Exos配置的消融实验验证了各组件贡献：MoV路由相较于注意力聚合贡献+6.61个百分点，跨视角注意力贡献+4.92个百分点，原型锚定贡献+4.07个百分点，随机视角丢弃贡献+3.90个百分点。通过LoRA适配，SkillMoV仅训练23.32%的参数，相较于纯LoRA基线增加有限的开销。