Humans are able to intuitively deduce actions that took place between two states in observations via deductive reasoning. This is because the brain operates on a bidirectional communication model, which has radically improved the accuracy of recognition and prediction based on features connected to previous experiences. During the past decade, deep learning models for action recognition have significantly improved. However, deep neural networks struggle with these tasks on a smaller dataset for specific Action Recognition (AR) tasks. As with most action recognition tasks, the ambiguity of accurately describing activities in spatial-temporal data is a drawback that can be overcome by curating suitable datasets, including careful annotations and preprocessing of video data for analyzing various recognition tasks. In this study, we present a novel lightweight framework combining transfer learning techniques with a Conv2D LSTM layer to extract features from the pre-trained I3D model on the Kinetics dataset for a new AR task (Smart Baby Care) that requires a smaller dataset and less computational resources. Furthermore, we developed a benchmark dataset and an automated model that uses LSTM convolution with I3D (ConvLSTM-I3D) for recognizing and predicting baby activities in a smart baby room. Finally, we implemented video augmentation to improve model performance on the smart baby care task. Compared to other benchmark models, our experimental framework achieved better performance with less computational resources.
翻译:人类能够通过演绎推理,直观地推断出观察中两个状态之间发生的动作。这是因为大脑基于双向通信模型运作,该模型显著提升了基于与先前经验相关的特征进行识别和预测的准确性。在过去十年中,用于动作识别的深度学习模型取得了显著进步。然而,针对特定动作识别任务的小规模数据集,深度神经网络仍面临挑战。与大多数动作识别任务类似,准确描述时空数据中活动的模糊性是一个缺陷,但可以通过整理合适的数据集来克服,包括对视频数据进行细致的标注和预处理,以分析各种识别任务。在本研究中,我们提出了一种新颖的轻量级框架,结合迁移学习技术与Conv2D LSTM层,从基于Kinetics数据集预训练的I3D模型中提取特征,用于一项需要更小数据集和更少计算资源的新动作识别任务(智能婴儿护理)。此外,我们开发了一个基准数据集和一个自动化模型,该模型采用LSTM卷积与I3D结合的方法,用于识别和预测智能婴儿房中的婴儿活动。最后,我们实施了视频增强技术,以提升模型在智能婴儿护理任务上的性能。与其他基准模型相比,我们的实验框架在消耗更少计算资源的情况下取得了更优的性能。