The research introduces a reproducible framework for transforming raw, heterogeneous sensor streams into aligned, semantically meaningful representations for multimodal human activity recognition. Grounded in the Carnegie Mellon University Multi-Modal Activity Database (CMU-MMAC) database and focused on the naturalistic Subject 07 Brownie session, the study traces the full pipeline from data ingestion to modeling and interpretation. Unlike black box preprocessing, a unified preprocessing workflow is proposed that temporally aligns video, audio, and RFID through resampling, grayscale conversion, sliding-window segmentation, and modality-specific normalization, producing standardized fused tensors suitable for downstream learning. Building on this foundation, the work systematically compares early, late, and hybrid fusion strategies using LSTM-based models implemented with PyTorch and TensorFlow, showing that late fusion consistently achieves the highest validation accuracy, with hybrid fusion outperforming early fusion. To evaluate interpretability and modality contribution, PCA and t-SNE visualizations reveal coherent temporal structure and confirm that the video carries stronger discriminative power than audio, while their combination yields substantial performance gains. Incorporating sparse, asynchronous RFID signals further improves accuracy by over 50% and boosts macro-averaged ROC-AUC, demonstrating the added value of object-interaction cues. Overall, the framework contributes a modular, empirically validated approach to multimodal fusion that links preprocessing design, fusion architecture, and interpretability, offering a transferable template for intelligent systems operating in complex, real-world activity settings.
翻译:本研究提出一种可复现框架,用于将原始异构传感器流转化为对齐且具有语义意义的多模态人体活动识别表征。基于卡内基梅隆大学多模态活动数据库(CMU-MMAC),聚焦自然场景下的Subject 07 Brownie实验会话,完整追踪从数据采集到建模与解释的全流程。为避免黑箱预处理弊端,提出统一预处理工作流:通过重采样、灰度转换、滑动窗口分割及模态特定归一化,实现视频、音频与RFID信号的时间对齐,生成标准化融合张量供下游学习使用。在此基础上,系统比较了基于LSTM模型(分别采用PyTorch与TensorFlow实现)的早期融合、晚期融合与混合融合策略,结果表明晚期融合始终获得最高验证准确率,混合融合性能优于早期融合。在可解释性与模态贡献评估方面,PCA与t-SNE可视化揭示了连贯的时间结构,证实视频比音频具有更强判别力,二者组合可显著提升性能。融入稀疏异步RFID信号后,准确率提升超50%,宏平均ROC-AUC值同步提高,充分体现物体交互线索的附加价值。总体而言,本框架构建了连接预处理设计、融合架构与可解释性的模块化经验验证方法,为复杂真实活动场景下的智能系统提供了可迁移的通用模板。