Traditional Automatic License Plate Recognition (ALPR) systems employ multi-stage pipelines consisting of object detection networks followed by separate Optical Character Recognition (OCR) modules, introducing compounding errors, increased latency, and architectural complexity. This research presents Neural Sentinel, a novel unified approach that leverages Vision Language Models (VLMs) to perform license plate recognition, state classification, and vehicle attribute extraction through a single forward pass. Our primary contribution lies in demonstrating that a fine-tuned PaliGemma 3B model, adapted via Low-Rank Adaptation (LoRA), can simultaneously answer multiple visual questions about vehicle images, achieving 92.3% plate recognition accuracy, which is a 14.1% improvement over EasyOCR and 9.9% improvement over PaddleOCR baselines. We introduce a Human-in-the-Loop (HITL) continual learning framework that incorporates user corrections while preventing catastrophic forgetting through experience replay, maintaining a 70:30 ratio of original training data to correction samples. The system achieves a mean inference latency of 152ms with an Expected Calibration Error (ECE) of 0.048, indicating well calibrated confidence estimates. Additionally, the VLM first architecture enables zero-shot generalization to auxiliary tasks including vehicle color detection (89%), seatbelt detection (82%), and occupancy counting (78%) without task specific training. Through extensive experimentation on real world toll plaza imagery, we demonstrate that unified vision language approaches represent a paradigm shift in ALPR systems, offering superior accuracy, reduced architectural complexity, and emergent multi-task capabilities that traditional pipeline approaches cannot achieve.
翻译:传统的自动车牌识别系统采用多阶段流水线架构,通常由目标检测网络与独立的光学字符识别模块串联构成,这种设计会引发误差累积、延迟增加和架构复杂性问题。本研究提出神经哨兵——一种创新的统一方法,通过视觉语言模型在单次前向传播中同时完成车牌识别、状态分类和车辆属性提取。我们的核心贡献在于证明:通过低秩自适应微调的PaliGemma 3B模型能够同步回答关于车辆图像的多个视觉问题,实现92.3%的车牌识别准确率,较EasyOCR基线提升14.1%,较PaddleOCR基线提升9.9%。我们引入了人机协同持续学习框架,在通过经验回放防止灾难性遗忘的同时整合用户修正数据,保持原始训练数据与修正样本70:30的比例。该系统在真实收费站图像数据集上的实验表明,平均推理延迟为152毫秒,预期校准误差为0.048,表明置信度估计具有良好校准性。此外,这种视觉语言模型优先的架构实现了对辅助任务的零样本泛化能力,包括车辆颜色检测(89%)、安全带检测(82%)和乘员计数(78%),且无需任务特定训练。通过大规模真实场景实验,我们证明统一视觉语言方法代表了自动车牌识别系统的范式转变,在提供卓越准确率、降低架构复杂性的同时,展现出传统流水线方法无法实现的涌现多任务能力。