Accurate prediction of perceptual attributes of haptic textures is essential for advancing VR and AR applications and enhancing robotic interaction with physical surfaces. This paper presents a deep learning-based multi-modal framework, incorporating visual and tactile data, to predict perceptual texture ratings by leveraging multi-feature inputs. To achieve this, a four-dimensional haptic attribute space encompassing rough-smooth, flat-bumpy, sticky-slippery, and hard-soft dimensions is first constructed through psychophysical experiments, where participants evaluate 50 diverse real-world texture samples. A physical signal space is subsequently created by collecting visual and tactile data from these textures. Finally, a deep learning architecture integrating a CNN-based autoencoder for visual feature learning and a ConvLSTM network for tactile data processing is trained to predict user-assigned attribute ratings. This multi-modal, multi-feature approach maps physical signals to perceptual ratings, enabling accurate predictions for unseen textures. To evaluate predictive accuracy, we employed leave-one-out cross-validation to rigorously assess the model's reliability and generalizability against several machine learning and deep learning baselines. Experimental results demonstrate that the framework consistently outperforms single-modality approaches, achieving lower MAE and RMSE, highlighting the efficacy of combining visual and tactile modalities.
翻译:准确预测触觉纹理的感知属性对于推进VR和AR应用以及增强机器人与物理表面的交互至关重要。本文提出了一种基于深度学习的多模态框架,该框架融合视觉与触觉数据,通过利用多特征输入来预测感知纹理评分。为实现这一目标,首先通过心理物理实验构建了一个四维触觉属性空间,涵盖粗糙-光滑、平坦-凹凸、粘滞-滑顺以及坚硬-柔软四个维度,实验中参与者对50种不同的真实世界纹理样本进行了评估。随后,通过采集这些纹理的视觉与触觉数据构建了物理信号空间。最后,训练了一个深度学习架构,该架构集成了基于CNN的自编码器用于视觉特征学习,以及ConvLSTM网络用于触觉数据处理,以预测用户分配的属性评分。这种多模态、多特征的方法将物理信号映射到感知评分,从而能够对未见过的纹理进行准确预测。为评估预测准确性,我们采用留一法交叉验证,针对多种机器学习和深度学习基线模型,严格评估了所提模型的可靠性与泛化能力。实验结果表明,该框架在各项性能上均持续优于单模态方法,取得了更低的MAE与RMSE,凸显了融合视觉与触觉模态的有效性。