As generative AI continues to evolve, Vision Language Models (VLMs) have emerged as promising tools in various healthcare applications. One area that remains relatively underexplored is their use in human activity recognition (HAR) for remote health monitoring. VLMs offer notable strengths, including greater flexibility and the ability to overcome some of the constraints of traditional deep learning models. However, a key challenge in applying VLMs to HAR lies in the difficulty of evaluating their dynamic and often non-deterministic outputs. To address this gap, we introduce a descriptive caption data set and propose comprehensive evaluation methods to evaluate VLMs in HAR. Through comparative experiments with state-of-the-art deep learning models, our findings demonstrate that VLMs achieve comparable performance and, in some cases, even surpass conventional approaches in terms of accuracy. This work contributes a strong benchmark and opens new possibilities for the integration of VLMs into intelligent healthcare systems.
翻译:随着生成式人工智能的持续发展,视觉语言模型(VLMs)已成为各类医疗应用中的潜力工具。其在远程健康监测中的人体活动识别(HAR)领域的应用仍相对缺乏探索。视觉语言模型具备显著优势,包括更高的灵活性以及能够克服传统深度学习模型的某些局限。然而,将视觉语言模型应用于人体活动识别的一个关键挑战在于评估其动态且往往非确定性输出的困难。为弥补这一空白,我们引入了一个描述性字幕数据集,并提出了一套综合评估方法来评估视觉语言模型在人体活动识别中的表现。通过与最先进的深度学习模型进行对比实验,我们的研究结果表明,视觉语言模型取得了与之相当的性能,在某些情况下甚至在准确率方面超越了传统方法。本工作贡献了一个坚实的基准,并为视觉语言模型融入智能医疗系统开辟了新的可能性。