HapticLLaMA: A Multimodal Sensory Language Model for Haptic Captioning

Haptic captioning is the task of generating natural language descriptions from haptic signals, such as vibrations, for use in virtual reality, accessibility, and rehabilitation applications. While previous multimodal research has focused primarily on vision and audio, haptic signals for the sense of touch remain underexplored. To address this gap, we formalize the haptic captioning task and propose HapticLLaMA, a multimodal sensory language model that interprets vibration signals into descriptions in a given sensory, emotional, or associative category. We investigate two types of haptic tokenizers, a frequency-based tokenizer and an EnCodec-based tokenizer, that convert haptic signals into sequences of discrete units, enabling their integration with the LLaMA model. HapticLLaMA is trained in two stages: (1) supervised fine-tuning using the LLaMA architecture with LoRA-based adaptation, and (2) fine-tuning via reinforcement learning from human feedback (RLHF). We assess HapticLLaMA's captioning performance using both automated n-gram metrics and human evaluation. HapticLLaMA demonstrates strong capability in interpreting haptic vibration signals, achieving a METEOR score of 59.98 and a BLEU-4 score of 32.06 respectively. Additionally, over 61% of the generated captions received human ratings above 3.5 on a 7-point scale, with RLHF yielding a 10% improvement in the overall rating distribution, indicating stronger alignment with human haptic perception. These findings highlight the potential of large language models to process and adapt to sensory data.

翻译：触觉描述任务旨在从触觉信号（如振动）中生成自然语言描述，应用于虚拟现实、无障碍辅助及康复治疗等领域。尽管以往的多模态研究主要集中于视觉与听觉，但针对触觉的触觉信号仍未被充分探索。为填补这一空白，本文形式化定义了触觉描述任务，并提出了HapticLLaMA——一种多模态感知语言模型，能够将振动信号转化为给定感知、情感或联想类别的描述。我们研究了两种触觉分词器：基于频率的分词器与基于EnCodec的分词器，它们将触觉信号转换为离散单元序列，从而使其能够与LLaMA模型集成。HapticLLaMA的训练分为两个阶段：（1）使用基于LoRA适配的LLaMA架构进行监督微调；（2）通过人类反馈强化学习（RLHF）进行微调。我们采用自动n-元度量指标与人工评估相结合的方式评估HapticLLaMA的描述性能。实验表明，HapticLLaMA在解释触觉振动信号方面表现出强大能力，其METEOR分数达到59.98，BLEU-4分数达到32.06。此外，超过61%的生成描述在7分量表上获得高于3.5的人工评分，且RLHF使整体评分分布提升了10%，表明模型输出与人类触觉感知具有更强的对齐性。这些发现凸显了大型语言模型在处理与适应感知数据方面的潜力。