With the rapid development of wearable cameras, a massive collection of egocentric video for first-person visual perception becomes available. Using egocentric videos to predict first-person activity faces many challenges, including limited field of view, occlusions, and unstable motions. Observing that sensor data from wearable devices facilitates human activity recognition, multi-modal activity recognition is attracting increasing attention. However, the deficiency of related dataset hinders the development of multi-modal deep learning for egocentric activity recognition. Nowadays, deep learning in real world has led to a focus on continual learning that often suffers from catastrophic forgetting. But the catastrophic forgetting problem for egocentric activity recognition, especially in the context of multiple modalities, remains unexplored due to unavailability of dataset. In order to assist this research, we present a multi-modal egocentric activity dataset for continual learning named UESTC-MMEA-CL, which is collected by self-developed glasses integrating a first-person camera and wearable sensors. It contains synchronized data of videos, accelerometers, and gyroscopes, for 32 types of daily activities, performed by 10 participants. Its class types and scale are compared with other publicly available datasets. The statistical analysis of the sensor data is given to show the auxiliary effects for different behaviors. And results of egocentric activity recognition are reported when using separately, and jointly, three modalities: RGB, acceleration, and gyroscope, on a base network architecture. To explore the catastrophic forgetting in continual learning tasks, four baseline methods are extensively evaluated with different multi-modal combinations. We hope the UESTC-MMEA-CL can promote future studies on continual learning for first-person activity recognition in wearable applications.
翻译:随着可穿戴摄像头的快速发展,大量用于第一人称视觉感知的自我中心视频得以获取。利用自我中心视频预测第一人称活动面临诸多挑战,包括视野受限、遮挡以及不稳定运动。观察到来自可穿戴设备的传感器数据有助于人类活动识别,多模态活动识别正日益受到关注。然而,相关数据集的缺乏阻碍了多模态深度学习在自我中心活动识别中的发展。目前,现实世界中的深度学习聚焦于持续学习,而持续学习常受灾难性遗忘问题困扰。但由于数据集的不可获得性,自我中心活动识别中的灾难性遗忘问题(尤其是在多模态场景下)仍未得到探索。为助力该研究,我们提出一个用于持续学习的多模态自我中心活动数据集,名为UESTC-MMEA-CL,该数据集由集成了第一人称摄像头和可穿戴传感器的自研眼镜采集。它包含视频、加速度计和陀螺仪的同步数据,涵盖10名受试者执行的32种日常活动类型。我们将其类别种类与规模与其他公开数据集进行比较,并给出传感器数据的统计分析以展示其对不同行为的辅助效果。同时,基于基础网络架构,报告了分别及联合使用RGB、加速度和陀螺仪三种模态进行自我中心活动识别的结果。为探索持续学习任务中的灾难性遗忘问题,我们针对不同多模态组合广泛评估了四种基线方法。希望UESTC-MMEA-CL能够推动可穿戴应用中第一人称活动识别的持续学习研究。