Model-Driven Engineering (MDE) places models at the core of system and data engineering processes. In the context of research data, these models are typically expressed as schemas that define the structure and semantics of datasets. However, many domains still lack standardized models, and creating them remains a significant barrier, especially for non-experts. We present a hybrid approach that combines large language models (LLMs) with deterministic techniques to enable JSON Schema creation, modification, and schema mapping based on natural language inputs by the user. These capabilities are integrated into the open-source tool MetaConfigurator, which already provides visual model editing, validation, code generation, and form generation from models. For data integration, we generate schema mappings from heterogeneous JSON, CSV, XML, and YAML data using LLMs, while ensuring scalability and reliability through deterministic execution of generated mapping rules. The applicability of our work is demonstrated in an application example in the field of chemistry. By combining natural language interaction with deterministic safeguards, this work significantly lowers the barrier to structured data modeling and data integration for non-experts.
翻译:模型驱动工程(Model-Driven Engineering, MDE)将模型置于系统和数据工程流程的核心。在研究数据的背景下,这些模型通常表示为定义数据集结构与语义的模式。然而,许多领域仍缺乏标准化模型,创建这些模型仍然是一个重大障碍,尤其对于非专家而言。我们提出了一种混合方法,将大语言模型(LLMs)与确定性技术相结合,使用户能够基于自然语言输入来创建、修改JSON Schema以及执行模式映射。这些功能已集成到开源工具MetaConfigurator中,该工具已提供可视化模型编辑、验证、代码生成以及从模型生成表单的功能。在数据集成方面,我们利用LLMs从异构的JSON、CSV、XML和YAML数据生成模式映射,同时通过确定性执行生成的映射规则来确保可扩展性与可靠性。我们工作的适用性通过一个化学领域的应用示例得以展示。通过将自然语言交互与确定性保障机制相结合,这项工作显著降低了非专家进行结构化数据建模与数据集成的门槛。