Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.
翻译:动作质量评估(AQA)在物理治疗、体育教练和竞技评判中具有广泛应用。尽管视觉语言模型(VLM)在AQA领域颇具潜力,但其在此领域的实际性能仍鲜有系统表征。我们针对当前最先进的VLM模型,从活动领域(如健身、花样滑冰、跳水)、任务类型、表征方式及提示策略等多个维度开展了全面评估。基线结果表明,Gemini 3.1 Pro、Qwen3-VL和InternVL3.5模型的性能仅略高于随机水平。尽管引入骨架信息、定位指令、推理结构及上下文学习等策略能带来局部改进,但尚未出现持续有效的方法。预测分布分析揭示两种系统性偏差:一是倾向于无视视觉证据预测正确执行,二是对表层语言框架存在敏感性。为解决此类偏差而构建的对比式任务重组仅带来微弱改进,表明模型的局限性已超越这些偏差本身,指向了对细粒度动作质量评估的根本性困难。本研究成果为未来基于VLM的AQA研究建立了严格的基线,并提供了在可靠实际部署前需优先解决的失效模式行动框架。