Vision-based human activity recognition (HAR) has made substantial progress in recognizing predefined gestures but lacks adaptability for emerging activities. This paper introduces a paradigm shift by harnessing generative modeling and large language models (LLMs) to enhance vision-based HAR. We propose utilizing LLMs to generate descriptive textual representations of activities using pose keypoints as an intermediate representation. Incorporating pose keypoints adds contextual depth to the recognition process, allowing for sequences of vectors resembling text chunks, compatible with LLMs. This innovative fusion of computer vision and natural language processing holds significant potential for revolutionizing activity recognition. A proof of concept study on a Kinetics700 dataset subset validates the approach's efficacy, highlighting improved accuracy and interpretability. Future implications encompass enhanced accuracy, novel research avenues, model generalization, and ethical considerations for transparency. This framework has real-world applications, including personalized gym workout feedback and nuanced sports training insights. By connecting visual cues to interpretable textual descriptions, the proposed framework advances HAR accuracy and applicability, shaping the landscape of pervasive computing and activity recognition research. As this approach evolves, it promises a more insightful understanding of human activities across diverse contexts, marking a significant step towards a better world.
翻译:基于视觉的人类活动识别(HAR)在识别预定义手势方面取得了显著进展,但缺乏对新兴活动的适应性。本文通过利用生成式建模和大语言模型(LLMs)来增强基于视觉的HAR,引入了一种范式转变。我们提出利用LLMs以姿态关键点作为中间表示,生成活动的描述性文本表征。融入姿态关键点为识别过程增加了上下文深度,使得能够形成类似于文本块的向量序列,并与LLMs兼容。这种计算机视觉与自然语言处理的创新融合具有革新活动识别的巨大潜力。在Kinetics700数据集子集上开展的概念验证研究验证了该方法的有效性,突出了准确性和可解释性的提升。未来影响包括更高的准确性、新的研究途径、模型泛化能力以及伦理透明性方面的考量。该框架具有实际应用场景,例如个性化健身反馈和精细化的运动训练指导。通过将视觉线索与可解释的文本描述相连接,所提出的框架提升了HAR的准确性和适用性,塑造了普适计算与活动识别研究的格局。随着该方法的发展,它有望在不同情境下提供对人类活动更具洞察力的理解,标志着迈向更美好世界的重要一步。