Integrating inertial measurement units (IMUs) with large language models (LLMs) expands the potential of multimodal AI, enabling more nuanced human activity analysis. In this paper, we introduce LLaSA (Large Language and Sensor Assistant), a multimodal large language model built on LIMU-BERT and Llama, designed to interpret and answer queries related to human activities and motion analysis, leveraging sensor data and contextual reasoning. To develop LLaSA, we introduce two key datasets: SensorCaps, a comprehensive collection of 35,960 IMU-derived narratives with handcrafted features, and OpenSQA, an instruction-following dataset containing 179,727 question-answer pairs aware of the sensor and human activity context. These datasets provide diverse and rich inputs to train LLaSA for complex sensor-based queries. To optimize LLaSA's performance, we apply a unique hyperparameter tuning method, which significantly enhances its effectiveness in contextual question-answering tasks. Extensive evaluations, including a human-led assessment of the question-answering, demonstrate that LLaSA achieves superior data interpretation and context-aware responses compared to GPT-3.5-Turbo and Vicuna-1.5-13b-16K. These contributions advance the frontier of sensor-aware LLMs and create new opportunities for impactful multimodal research in healthcare, sports science, and human-computer interactions. Our code repository and datasets can be found at https://github.com/BASHLab/LLaSA.
翻译:将惯性测量单元(IMU)与大语言模型(LLM)相结合,拓展了多模态人工智能的潜力,实现了更精细的人类活动分析。本文提出LLaSA(大语言与传感器助手),一种基于LIMU-BERT与Llama构建的多模态大语言模型,旨在利用传感器数据与上下文推理来解读和回答与人类活动及运动分析相关的查询。为开发LLaSA,我们引入了两个关键数据集:SensorCaps(包含35,960条基于IMU生成并附带手工特征的叙述文本的综合数据集)和OpenSQA(包含179,727个感知传感器与人类活动上下文的指令遵循型问答对的数据集)。这些数据集为训练LLaSA处理复杂传感器查询提供了多样且丰富的输入。为优化LLaSA性能,我们采用了一种独特的超参数调优方法,显著提升了其在上下文问答任务中的效能。包括人工主导的问答评估在内的广泛实验表明,相较于GPT-3.5-Turbo和Vicuna-1.5-13b-16K,LLaSA在数据解读与上下文感知响应方面表现更优。这些成果推动了传感器感知型LLM的前沿发展,并为医疗健康、运动科学及人机交互等领域具有影响力的多模态研究创造了新机遇。我们的代码库与数据集可在 https://github.com/BASHLab/LLaSA 获取。