Existing approaches treat action quality assessment and skill proficiency estimation as classification problems, outputting discrete labels without interpretable reasoning. We reformulate this task as generative vision language modeling, introducing ProfVLM, a compact model that jointly predicts proficiency levels and generates expert-like natural language feedback from multi-view videos. ProfVLM leverages conditional language generation to provide actionable insights along with quantitative evaluation scores. Central to our method is an AttentiveGatedProjector that dynamically fuses and projects multi-view egocentric and exocentric features from a frozen TimeSformer backbone into a language model fine-tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60% compared to existing classification-based methods. By providing natural language critiques aligned with performance levels, this work shows that generative vision-language modeling offers a powerful and efficient paradigm shift for interpretable action quality assessment.
翻译:现有方法将动作质量评估与技能熟练度估计视为分类问题,输出离散标签而缺乏可解释的推理过程。我们将此任务重新构建为生成式视觉语言建模,提出了ProfVLM——一个紧凑的模型,能够从多视角视频中联合预测熟练度等级并生成类专家的自然语言反馈。ProfVLM利用条件语言生成技术,在提供量化评估分数的同时输出可操作的见解。我们方法的核心在于AttentiveGatedProjector模块,该模块能够动态融合并投影来自冻结TimeSformer骨干网络的多视角第一人称与第三人称特征,将其输入至专为反馈生成而微调的语言模型中。通过在EgoExo4D数据集上结合专家评注进行训练,ProfVLM在性能上超越了现有最佳方法,同时与基于分类的现有方法相比,参数量减少了高达20倍,训练时间缩短了高达60%。通过提供与表现水平相对应的自然语言评析,本研究表明生成式视觉语言建模为可解释的动作质量评估提供了一种强大而高效的范式转变。