Oral cancer is frequently diagnosed at later stages due to its similarity to other lesions. Existing research on computer aided diagnosis has made progress using deep learning; however, most approaches remain limited by small, imbalanced datasets and a dependence on single-modality features, which restricts model generalization in real-world clinical settings. To address these limitations, this study proposes a novel data-augmentation driven multimodal feature-fusion framework integrated within a (Vision Recognition)VR assisted oral cancer recognition system. Our method combines extensive data centric augmentation with fused clinical and image-based representations to enhance model robustness and reduce diagnostic ambiguity. Using a stratified training pipeline and an EfficientNetV2 B1 backbone, the system improves feature diversity, mitigates imbalance, and strengthens the learned multimodal embeddings. Experimental evaluation demonstrates that the proposed framework achieves an overall accuracy of 82.57 percent on 2 classes, 65.13 percent on 3 classes, and 54.97 percent on 4 classes, outperforming traditional single stream CNN models. These results highlight the effectiveness of multimodal feature fusion combined with strategic augmentation for reliable early oral cancer lesion recognition and serve as a foundation for immersive VR based clinical decision support tools.
翻译:口腔癌因其与其他病变的相似性,常在晚期才被诊断。现有计算机辅助诊断研究利用深度学习已取得进展;然而,大多数方法仍受限于小型、不平衡的数据集以及对单模态特征的依赖,这限制了模型在真实临床环境中的泛化能力。为解决这些局限,本研究提出一种新颖的数据增强驱动的多模态特征融合框架,并将其集成于(视觉识别)VR辅助的口腔癌识别系统中。我们的方法结合了以数据为中心的大规模增强与融合的临床及基于图像的表征,以增强模型鲁棒性并减少诊断歧义。通过采用分层训练流程和EfficientNetV2 B1主干网络,该系统提升了特征多样性,缓解了数据不平衡问题,并强化了学习到的多模态嵌入。实验评估表明,所提框架在2分类任务上达到82.57%的整体准确率,在3分类任务上达到65.13%,在4分类任务上达到54.97%,性能优于传统的单流CNN模型。这些结果凸显了多模态特征融合与策略性数据增强相结合对于实现可靠早期口腔癌病变识别的有效性,并为基于沉浸式VR的临床决策支持工具奠定了基础。