A Generalizable Framework for Building Executable Domain-Specific LLMs under Data Scarcity: Demonstration on Semiconductor TCAD Simulation

Scientific and engineering verticals often suffer from data scarcity and strict executability requirements: models must generate not only fluent text, but also syntactically valid, tool-compilable scripts. We present a schema-first alignment framework for building compact, executable domain-specific LLMs in low-resource settings. The framework integrates three core components: (i) large-scale synthetic QA data generation from expert documentation to instill foundational domain knowledge; (ii) a code-centric IR->DPO workflow that converts verified tool decks into interpretable intermediate representations (IR), performs equivalence-preserving diversification, and constructs preference pairs to directly optimize instruction compliance and code executability; and (iii) a controlled evaluation of Retrieval-Augmented Generation (RAG), showing that while RAG benefits general LLMs, it can marginally degrade the performance of already domain-aligned models. We demonstrate the framework by instantiating TcadGPT for semiconductor Technology Computer-Aided Design (TCAD). Using 1.5M synthetic QA pairs and an IR-driven DPO dataset, TcadGPT attains 85.6% semantic accuracy and an 80.0% syntax pass rate on SDE executability tests, substantially outperforming state-of-the-art general LLMs such as GPT-4o. To probe portability beyond TCAD, we apply the same recipe to the open-source FEM solver Elmer, observing consistent improvements in script-level success rates over general-purpose baselines. All datasets, benchmarks, and code (including P1, P2, and IR->DPO) are released for reproducibility. Together, these results suggest that the proposed framework provides a robust and reproducible path toward executable LLMs in specialized, data-scarce professional domains.

翻译：科学与工程垂直领域常面临数据稀缺与严格可执行性要求的双重挑战：模型不仅需生成流畅文本，还必须输出语法正确、能被工具编译的脚本。本文提出一种模式优先的对齐框架，用于在低资源环境下构建紧凑、可执行的领域特定大语言模型。该框架整合了三个核心组件：（i）基于专家文档的大规模合成问答数据生成，以注入基础领域知识；（ii）一种以代码为中心的IR->DPO工作流，将已验证的工具配置文件转换为可解释的中间表示，执行保持等价性的多样化处理，并构建偏好对以直接优化指令遵循性与代码可执行性；（iii）对检索增强生成的受控评估，表明尽管RAG对通用大语言模型有益，却可能轻微降低已进行领域对齐模型的性能。我们通过实例化面向半导体技术计算机辅助设计的TcadGPT来验证该框架。利用150万组合成问答对与IR驱动的DPO数据集，TcadGPT在SDE可执行性测试中达到85.6%的语义准确率与80.0%的语法通过率，显著优于GPT-4o等最先进的通用大语言模型。为探索框架在TCAD之外的迁移性，我们将相同方案应用于开源有限元求解器Elmer，观察到脚本级成功率相对通用基线模型获得持续提升。所有数据集、基准测试及代码（包括P1、P2与IR->DPO）均已开源以确保可复现性。综合而言，这些结果表明所提框架为在专业化、数据稀缺的专业领域构建可执行大语言模型提供了稳健且可复现的技术路径。