This paper presents a novel deep learning framework for robotic arm manipulation that integrates multimodal inputs using a late-fusion strategy. Unlike traditional end-to-end or reinforcement learning approaches, our method processes image sequences with pre-trained models and robot state data with machine learning algorithms, fusing their outputs to predict continuous action values for control. Evaluated on BridgeData V2 and Kuka datasets, the best configuration (VGG16 + Random Forest) achieved MSEs of 0.0021 and 0.0028, respectively, demonstrating strong predictive performance and robustness. The framework supports modularity, interpretability, and real-time decision-making, aligning with the goals of adaptive, human-in-the-loop cyber-physical systems.
翻译:本文提出了一种新颖的机器人手臂操作深度学习框架,该框架采用后期融合策略整合多模态输入。与传统端到端或强化学习方法不同,我们的方法使用预训练模型处理图像序列,同时采用机器学习算法处理机器人状态数据,并通过融合其输出来预测用于控制的连续动作值。在BridgeData V2和Kuka数据集上的评估结果表明,最佳配置(VGG16 + 随机森林)分别取得了0.0021和0.0028的均方误差,展现出强大的预测性能和鲁棒性。该框架支持模块化、可解释性和实时决策,符合自适应、人机协同的信息物理系统目标。