Accurate perception of object hardness is essential for safe and dexterous contact-rich robotic manipulation. Here, we present TactEx, an explainable multimodal robotic interaction framework that unifies vision, touch, and language for human-like hardness estimation and interactive guidance. We evaluate TactEx on fruit-ripeness assessment, a representative task that requires both tactile sensing and contextual understanding. The system fuses GelSight-Mini tactile streams with RGB observations and language prompts. A ResNet50+LSTM model estimates hardness from sequential tactile data, while a cross-modal alignment module combines visual cues with guidance from a large language model (LLM). This explainable multimodal interface allows users to distinguish ripeness levels with statistically significant class separation (p < 0.01 for all fruit pairs). For touch placement, we compare YOLO with Grounded-SAM (GSAM) and find GSAM to be more robust for fine-grained segmentation and contact-site selection. A lightweight LLM parses user instructions and produces grounded natural-language explanations linked to the tactile outputs. In end-to-end evaluations, TactEx attains 90% task success on simple user queries and generalises to novel tasks without large-scale tuning. These results highlight the promise of combining pretrained visual and tactile models with language grounding to advance explainable, human-like touch perception and decision-making in robotics.
翻译:准确感知物体硬度对于实现安全且灵巧的接触密集型机器人操作至关重要。本文提出TactEx,一种可解释的多模态机器人交互框架,它统一了视觉、触觉和语言模态,以实现类人的硬度估计与交互式引导。我们在水果成熟度评估这一代表性任务上评估TactEx,该任务同时需要触觉感知和上下文理解。系统融合了GelSight-Mini触觉流、RGB观测数据和语言提示。一个ResNet50+LSTM模型从序列触觉数据中估计硬度,而一个跨模态对齐模块则将视觉线索与大型语言模型(LLM)的引导相结合。这一可解释的多模态界面使用户能够区分成熟度等级,并实现统计上显著的类别分离(所有水果配对p < 0.01)。在触觉放置方面,我们比较了YOLO与Grounded-SAM(GSAM),发现GSAM在细粒度分割和接触点选择方面更具鲁棒性。一个轻量级LLM解析用户指令,并生成与触觉输出相关联的、基于实际感知的自然语言解释。在端到端评估中,TactEx在简单用户查询上达到90%的任务成功率,并能泛化至新任务而无需大规模调优。这些结果凸显了将预训练的视觉与触觉模型与语言基础相结合,在推动机器人领域可解释、类人的触觉感知与决策方面具有广阔前景。