We built a multi-task pipeline for tennis stroke biomechanics from plain RGB video. On top of pose-based stroke recognition, it adds two new tasks, predicting shot direction and grading posture quality, plus a rule-based feedback layer that suggests coaching tips. Strokes are found automatically using a weighted joint velocity score, s(t) = 0.5 v_wrist + 0.3 m_elbow + 0.2 m_shoulder, removing the need for manual annotation. Pose comes from MediaPipe Pose Landmarker (33 landmarks, metric world coordinates), with each stroke turned into a 30-frame by 39-feature sequence for TennisTransformerGPU, a compact 564,103-parameter transformer (4 layers, 4 heads, d=128) with three parallel output heads. Trained on 1,281 labeled strokes from 7 pros and 1 amateur across 11 videos, it hits 83.7% stroke-type accuracy, 61.9% on direction, and 62.6% on posture under a random 80/20 split. The interesting test is cross-player: train on pros, evaluate on the amateur. Stroke type barely budges, 82.9%, a 0.8% drop. Direction prediction does not transfer; it just falls back to the majority class. An ablation shows why world coordinates matter so much here: switching to image-space landmarks tanks cross-player stroke-type accuracy from 83% to 47% and direction from 68% to 21%. Everything runs on Kaggle's free T4 GPU tier and is fully reproducible.
翻译:我们构建了一个多任务流水线,用于从普通RGB视频中分析网球击球生物力学。在基于姿态的击球识别基础上,该流水线新增两个任务:预测击球方向和姿势质量评分,并配置了一个基于规则的反馈层,用于提供教练建议。通过加权关节速度评分s(t) = 0.5 v_wrist + 0.3 m_elbow + 0.2 m_shoulder自动检测击球动作,无需人工标注。姿态数据来自MediaPipe Pose Landmarker(33个关键点,公制世界坐标),每个击球被转化为30帧×39个特征的序列,输入至TennisTransformerGPU——一个紧凑型变压器模型(564,103个参数,4层,4头,d=128),配备三个并行输出头。该模型在来自11段视频的7名专业选手和1名业余选手的1,281个标注击球样本上训练,经随机80/20划分后,击球类型准确率达83.7%,方向预测准确率61.9%,姿势评分准确率62.6%。更具挑战性的跨选手测试中(以专业选手数据训练,对业余选手评估),击球类型准确率仅下降0.8%至82.9%,但方向预测失效,仅退化为多数类预测。消融实验揭示了世界坐标的重要性:切换至图像空间关键点后,跨选手击球类型准确率从83%骤降至47%,方向准确率从68%降至21%。整个系统可在Kaggle免费T4 GPU上运行,完全可复现。