ETL (Extract, Transform, Load) tools such as IBM DataStage allow users to visually assemble complex data workflows, but configuring stages and their properties remains time consuming and requires deep tool knowledge. We propose a system that translates natural language descriptions into executable workflows, automatically predicting both the structure and detailed configuration of the flow. At its core lies a Classifier-Augmented Generation (CAG) approach that combines utterance decomposition with a classifier and stage-specific few-shot prompting to produce accurate stage predictions. These stages are then connected into non-linear workflows using edge prediction, and stage properties are inferred from sub-utterance context. We compare CAG against strong single-prompt and agentic baselines, showing improved accuracy and efficiency, while substantially reducing token usage. Our architecture is modular, interpretable, and capable of end-to-end workflow generation, including robust validation steps. To our knowledge, this is the first system with a detailed evaluation across stage prediction, edge layout, and property generation for natural-language-driven ETL authoring.
翻译:诸如IBM DataStage之类的ETL(抽取、转换、加载)工具允许用户以可视化方式组装复杂的数据工作流,但配置各阶段及其属性仍然耗时且需要深厚的工具知识。我们提出了一种将自然语言描述转换为可执行工作流的系统,能够自动预测流程的结构与详细配置。其核心是一种分类器增强生成方法,该方法将话语分解与分类器及针对特定阶段的少样本提示相结合,以生成准确的工作流阶段预测。随后,这些阶段通过边预测连接成非线性工作流,阶段属性则从子话语上下文中推断得出。我们将CAG与强大的单提示和智能体基线方法进行比较,结果显示其在提升准确性与效率的同时,显著降低了令牌使用量。我们的架构具有模块化、可解释的特点,并能实现端到端的工作流生成,包括鲁棒的验证步骤。据我们所知,这是首个在自然语言驱动的ETL构建中,对阶段预测、边布局和属性生成进行全面详细评估的系统。