VLMaterial: Vision-Language Model-Based Camera-Radar Fusion for Physics-Grounded Material Identification

Accurate material recognition is a fundamental capability for intelligent perception systems to interact safely and effectively with the physical world. For instance, distinguishing visually similar objects like glass and plastic cups is critical for safety but challenging for vision-based methods due to specular reflections, transparency, and visual deception. While millimeter-wave (mmWave) radar offers robust material sensing regardless of lighting, existing camera-radar fusion methods are limited to closed-set categories and lack semantic interpretability. In this paper, we introduce VLMaterial, a training-free framework that fuses vision-language models (VLMs) with domain-specific radar knowledge for physics-grounded material identification. First, we propose a dual-pipeline architecture: an optical pipeline uses the segment anything model and VLM for material candidate proposals, while an electromagnetic characterization pipeline extracts the intrinsic dielectric constant from radar signals via an effective peak reflection cell area (PRCA) method and weighted vector synthesis. Second, we employ a context-augmented generation (CAG) strategy to equip the VLM with radar-specific physical knowledge, enabling it to interpret electromagnetic parameters as stable references. Third, an adaptive fusion mechanism is introduced to intelligently integrate outputs from both sensors by resolving cross-modal conflicts based on uncertainty estimation. We evaluated VLMaterial in over 120 real-world experiments involving 41 diverse everyday objects and 4 typical visually deceptive counterfeits across varying environments. Experimental results demonstrate that VLMaterial achieves a recognition accuracy of 96.08%, delivering performance on par with state-of-the-art closed-set benchmarks while eliminating the need for extensive task-specific data collection and training.

翻译：精确的材料识别是智能感知系统安全且有效与物理世界交互的基础能力。例如，区分玻璃杯和塑料杯等视觉相似物体对安全性至关重要，但基于视觉的方法因镜面反射、透明度和视觉欺骗而面临挑战。虽然毫米波雷达能不受光照影响地实现稳健的材料感知，但现有的相机-雷达融合方法仅限于封闭集类别，且缺乏语义可解释性。本文提出VLMaterial，一种无需训练的框架，融合视觉语言模型与领域特定雷达知识，实现物理接地的材料识别。首先，我们提出双流水线架构：光学流水线利用分割任意模型和VLM提出材料候选，而电磁表征流水线通过有效峰值反射单元面积法和加权向量合成，从雷达信号中提取固有的介电常数。其次，我们采用上下文增强生成策略，为VLM配备雷达特定物理知识，使其能将电磁参数解释为稳定参考。第三，引入自适应融合机制，基于不确定性估计解决跨模态冲突，智能集成双传感器输出。我们在120余项真实世界实验中评估了VLMaterial，涵盖41种多样日常物体和4种典型视觉欺骗仿制品，实验环境各异。结果表明，VLMaterial识别准确率达96.08%，性能与最先进的封闭集基准持平，同时无需大量任务特定数据采集和训练。