Accurate human posture classification in images and videos is crucial for automated applications across various fields, including work safety, physical rehabilitation, sports training, or daily assisted living. Recently, multimodal learning methods, such as Contrastive Language-Image Pretraining (CLIP), have advanced significantly in jointly understanding images and text. This study aims to assess the effectiveness of CLIP in classifying human postures, focusing on its application in yoga. Despite the initial limitations of the zero-shot approach, applying transfer learning on 15,301 images (real and synthetic) with 82 classes has shown promising results. The article describes the full procedure for fine-tuning, including the choice for image description syntax, models and hyperparameters adjustment. The fine-tuned CLIP model, tested on 3826 images, achieves an accuracy of over 85%, surpassing the current state-of-the-art of previous works on the same dataset by approximately 6%, its training time being 3.5 times lower than what is needed to fine-tune a YOLOv8-based model. For more application-oriented scenarios, with smaller datasets of six postures each, containing 1301 and 401 training images, the fine-tuned models attain an accuracy of 98.8% and 99.1%, respectively. Furthermore, our experiments indicate that training with as few as 20 images per pose can yield around 90% accuracy in a six-class dataset. This study demonstrates that this multimodal technique can be effectively used for yoga pose classification, and possibly for human posture classification, in general. Additionally, CLIP inference time (around 7 ms) supports that the model can be integrated into automated systems for posture evaluation, e.g., for developing a real-time personal yoga assistant for performance assessment.
翻译:图像与视频中准确的人体姿态分类对于工作安全、物理康复、运动训练或日常辅助生活等多个领域的自动化应用至关重要。近年来,多模态学习方法(如对比语言-图像预训练(CLIP))在联合理解图像与文本方面取得了显著进展。本研究旨在评估CLIP在人体姿态分类中的有效性,重点关注其在瑜伽领域的应用。尽管零样本方法存在初始局限性,但通过对包含82个类别的15,301张图像(真实与合成图像)进行迁移学习,已展现出有希望的结果。本文详细描述了微调全过程,包括图像描述语法的选择、模型与超参数的调整。经3,826张图像测试,微调后的CLIP模型准确率超过85%,较同一数据集上先前工作的当前最优结果提升约6%,且其训练时间比基于YOLOv8模型的微调所需时间减少3.5倍。在更侧重实际应用的场景中,针对两个分别包含1,301张和401张训练图像的六姿态小型数据集,微调模型分别达到98.8%和99.1%的准确率。此外,实验表明在六类数据集中,每个姿态仅用20张图像进行训练即可获得约90%的准确率。本研究证明该多模态技术可有效用于瑜伽姿势分类,并可能泛化至一般人体姿态分类任务。同时,CLIP的推理时间(约7毫秒)表明该模型可集成于姿态评估自动化系统,例如用于开发实时个人瑜伽辅助性能评估工具。