WEAR: A Multimodal Dataset for Wearable and Egocentric Video Activity Recognition

Though research has shown the complementarity of camera- and inertial-based data, datasets which offer both modalities remain scarce. In this paper we introduce WEAR, a multimodal benchmark dataset for both vision- and wearable-based Human Activity Recognition (HAR). The dataset comprises data from 18 participants performing a total of 18 different workout activities with untrimmed inertial (acceleration) and camera (egocentric video) data recorded at 10 different outside locations. WEAR features a diverse set of activities which are low in inter-class similarity and, unlike previous egocentric datasets, not defined by human-object-interactions nor originate from inherently distinct activity categories. Provided benchmark results reveal that single-modality architectures have different strengths and weaknesses in their prediction performance. Further, in light of the recent success of transformer-based video action detection models, we demonstrate their versatility by applying them in a plain fashion using vision, inertial and combined (vision + inertial) features as input. Results show that vision transformers are not only able to produce competitive results using only inertial data, but also can function as an architecture to fuse both modalities by means of simple concatenation, with the multimodal approach being able to produce the highest average mAP, precision and close-to-best F1-scores. Up until now, vision-based transformers have neither been explored in inertial nor in multimodal human activity recognition, making our approach the first to do so. The dataset and code to reproduce experiments is publicly available via: mariusbock.github.io/wear

翻译：尽管研究表明基于摄像头和惯性传感器数据具有互补性，但同时提供这两种模态的数据集仍然稀缺。本文介绍了WEAR，一个面向视觉与可穿戴人体活动识别（HAR）的多模态基准数据集。该数据集包含18名参与者在10个不同室外地点执行18种不同健身活动的数据，包括未经修剪的惯性（加速度计）和摄像头（自我中心视频）记录。与先前自我中心数据集不同，WEAR包含类间相似度低且不依赖于人-物交互或本质上属于截然不同活动类别的多样化活动集合。提供的基准结果表明，单模态架构在预测性能上具有不同优缺点。此外，鉴于近期基于Transformer的视频动作检测模型取得的成功，我们通过将其朴素地应用于视觉、惯性及视觉-惯性联合特征输入，展示了其通用性。结果表明，视觉Transformer不仅能够仅使用惯性数据生成具有竞争力的结果，还能通过简单拼接的方式作为融合两种模态的架构，这种多模态方法能够获得最高的平均mAP、精确度及接近最优的F1分数。迄今为止，基于视觉的Transformer在惯性或多模态人体活动识别领域尚未被探索，因此本工作是首次尝试。数据集及复现实验的代码已通过mariusbock.github.io/wear公开提供。