UniMotion: Self-Supervised Learning for Cross-Domain IMU Motion Recognition

IMU-based gesture interfaces are being increasingly adopted as efficient, accessible, and intuitive alternatives to traditional input methods, such as touchscreens and voice. However, current gesture recognition algorithms are tailored to work for specific devices (e.g., smartwatches vs. earbuds) or user populations (e.g., blind vs. sighted users), limiting their generalizability. In this paper, we design UniMotion, a generalized IMU-based gesture recognition framework that works across devices and populations with minimal training samples. To overcome the challenges and high cost of collecting large-scale labeled training data, UniMotion leverages readily available unlabeled human activity data. The UniMotion pipeline comprises two stages: (1) pre-training a motion representation model using abundant unlabeled human activity data, and (2) fine-tuning it with a small amount of labeled gesture data. For pre-training, we introduce a token-based strategy and embeddings that learn to identify and focus attention on the key motion signatures in the temporal data For fine-tuning, we design a text-guided classifier that can reliably differentiate between temporally or semantically similar gestures. We evaluate UniMotion across both hand gestures (captured through a smartwatch) and earbud gestures (captured through earbuds), using data collected from blind and sighted users. Across these diverse devices and user populations, UniMotion achieves an accuracy of 85\%, across an average of 13 gesture classes using only 10\% of labeled data for training. UniMotion significantly outperforms state-of-the-art self-supervised learning approaches and specialized gesture recognition models.

翻译：基于惯性测量单元（IMU）的手势交互界面正日益成为触摸屏和语音等传统输入方式的高效、易用且直观的替代方案。然而，现有手势识别算法通常针对特定设备（如智能手表与无线耳机）或特定用户群体（如视障用户与视力正常用户）进行定制，其泛化能力受限。本文提出UniMotion——一种通用的基于IMU的手势识别框架，该框架能够在不同设备与用户群体间实现高效迁移，且仅需少量训练样本。为应对大规模标注数据收集的挑战与高昂成本，UniMotion充分利用易于获取的无标签人体活动数据。该框架采用两阶段流程：（1）利用海量无标签人体活动数据预训练运动表征模型；（2）使用少量标注手势数据进行微调。在预训练阶段，我们提出基于令牌的策略与嵌入表示，使模型能够识别并聚焦时序数据中的关键运动特征。在微调阶段，我们设计了文本引导分类器，可可靠区分时序或语义相似的手势。通过采集视障与视力正常用户的数据，我们在手势识别（通过智能手表采集）与耳机手势识别（通过无线耳机采集）两类场景中对UniMotion进行评估。在跨越不同设备与用户群体的测试中，UniMotion仅使用10%的标注数据进行训练，即在平均13个手势类别上达到85%的识别准确率。该框架显著优于当前最先进的自监督学习方法与专用手势识别模型。