High-Fidelity And Complex Test Data Generation For Google SQL Code Generation Services

The demand for high-fidelity test data is paramount in industrial settings where access to production data is largely restricted. Traditional data generation methods often fall short, struggling with low-fidelity and the ability to model complex data structures and semantic relationships that are critical for testing complex SQL code generation services like Natural Language to SQL (NL2SQL). In this paper, we address the critical need for generating syntactically correct and semantically relevant high-fidelity mock data for complex data structures that includes columns with nested structures that we frequently encounter in Google workloads. We highlight the limitations of existing approaches used in production, particularly their inability to handle large and complex data structures, as well as the lack of semantically coherent test data that lead to limited test coverage. We demonstrate that by leveraging Large Language Models (LLMs) and incorporating strategic pre- and post-processing steps, we can generate syntactically correct and semantically relevant high-fidelity test data that adheres to complex structural constraints and maintains semantic integrity to the SQL test targets (queries/functions). This approach supports comprehensive testing of complex SQL queries involving joins, aggregations, and even deeply nested subqueries, ensuring robust evaluation of SQL code generation services, like NL2SQL and SQL Code Assistant. Our results demonstrate the practical utility of an LLM (\textit{gemini}) based test data generation for industrial SQL code generation services where generating high-fidelity test data is essential due to the frequent unavailability and inaccessibility of production datasets for testing.

翻译：在工业环境中，由于生产数据的访问受到严格限制，对高保真测试数据的需求至关重要。传统的数据生成方法往往存在不足，难以应对低保真度问题，且缺乏对复杂数据结构与语义关系建模的能力，而这些对于测试诸如自然语言转SQL（NL2SQL）等复杂SQL代码生成服务至关重要。本文针对为复杂数据结构生成语法正确且语义相关的高保真模拟数据这一关键需求展开研究，此类数据结构包含我们在谷歌工作负载中频繁遇到的具有嵌套结构的列。我们重点指出了现有生产环境中使用方法的局限性，特别是其处理大规模复杂数据结构的能力不足，以及缺乏语义连贯的测试数据所导致的测试覆盖范围有限的问题。我们证明，通过利用大语言模型（LLMs）并结合策略性的预处理与后处理步骤，能够生成语法正确且语义相关的高保真测试数据。该数据不仅遵循复杂的结构约束，而且能保持与SQL测试目标（查询/函数）的语义完整性。此方法支持对涉及连接、聚合乃至深度嵌套子查询的复杂SQL查询进行全面测试，从而确保对NL2SQL和SQL代码助手等SQL代码生成服务进行稳健评估。我们的结果证明了基于大语言模型（\textit{gemini}）的测试数据生成方法在工业SQL代码生成服务中的实际效用，在这些场景中，由于生产数据集经常无法用于测试或难以获取，生成高保真测试数据变得至关重要。