Vision Language Models (VLMs) have demonstrated remarkable potential in multimodal reasoning, yet they inherently suffer from spatial blindness and logical hallucinations when interpreting densely structured engineering content, such as analog circuit schematics. To address these challenges, we propose a Vision Language Model-Optimized Collaborative Agent Design Workflow for Analog Circuit Sizing (VLM-CAD) designed for robust, step-by-step reasoning over multimodal evidence. VLM-CAD bridges the modality gap by integrating a neuro-symbolic structural parsing module, Image2Net, which transforms raw pixels into explicit topological graphs and structured JSON representations to anchor VLM interpretation in deterministic facts. To ensure the reliability required for engineering decisions, we further propose ExTuRBO, an Explainable Trust Region Bayesian Optimization method. ExTuRBO serves as an explainable grounding engine, employing agent-generated semantic seeds to warm-start local searches and utilizing Automatic Relevance Determination to provide quantified evidence for the VLM's decisions. Experimental results on two complex circuit benchmarks demonstrate that VLM-CAD significantly enhances spatial reasoning accuracy and maintains physics-based explainability. VLM-CAD consistently satisfies complex specification requirements while achieving low power consumption, with a total runtime under 66 minutes, marking a significant step toward robust, explainable multimodal reasoning in specialized technical domains.
翻译:视觉语言模型在多模态推理中展现出显著潜力,但在处理模拟电路原理图等高密度结构化工程内容时,其固有缺陷表现为空间认知盲区与逻辑幻觉。针对上述挑战,我们提出一种面向模拟电路尺寸设计的视觉语言模型优化协作智能体设计工作流(VLM-CAD),该方法专为基于多模态证据的鲁棒分步推理而设计。VLM-CAD通过集成神经符号结构化解析模块Image2Net弥合模态鸿沟,该模块将原始像素转换为显式拓扑图与结构化JSON表征,为VLM解释提供确定性事实锚点。为确保工程决策所需的可靠性,我们进一步提出可解释置信域贝叶斯优化方法ExTuRBO。该算法作为可解释性基础引擎,利用智能体生成的语义种子预热局部搜索,并通过自动相关性判定为VLM决策提供量化证据。在两项复杂电路基准测试中的实验结果表明,VLM-CAD显著提升了空间推理精度并保持了基于物理机制的可解释性。该方法在总运行时间低于66分钟的条件下,稳定满足复杂规格需求且实现低功耗,标志着在专业技术领域实现鲁棒可解释多模态推理方面的重要进展。