High-Fidelity And Complex Test Data Generation For Google SQL Code Generation Services

The demand for high-fidelity test data is paramount in industrial settings where access to production data is largely restricted. Traditional data generation methods often fall short, struggling with low-fidelity and the ability to model complex data structures and semantic relationships that are critical for testing complex SQL code generation services like Natural Language to SQL (NL2SQL). In this paper, we address the critical need for generating syntactically correct and semantically relevant high-fidelity mock data for complex data structures that includes columns with nested structures that we frequently encounter in Google workloads. We highlight the limitations of existing approaches used in production, particularly their inability to handle large and complex data structures, as well as the lack of semantically coherent test data that lead to limited test coverage. We demonstrate that by leveraging Large Language Models (LLMs) and incorporating strategic pre- and post-processing steps, we can generate syntactically correct and semantically relevant high-fidelity test data that adheres to complex structural constraints and maintains semantic integrity to the SQL test targets (queries/functions). This approach supports comprehensive testing of complex SQL queries involving joins, aggregations, and even deeply nested subqueries, ensuring robust evaluation of SQL code generation services, like NL2SQL and SQL Code Assistant. Our results demonstrate the practical utility of an LLM (\textit{Gemini}) based test data generation for industrial SQL code generation services where generating high-fidelity test data is essential due to the frequent unavailability and inaccessibility of production datasets for testing.

翻译：在工业环境中，由于生产数据的访问受到严格限制，对高保真测试数据的需求至关重要。传统的数据生成方法往往存在不足，难以实现高保真度，且无法有效建模复杂数据结构和语义关系，而这些对于测试诸如自然语言转SQL（NL2SQL）等复杂SQL代码生成服务至关重要。本文针对为复杂数据结构生成语法正确且语义相关的高保真模拟数据这一关键需求展开研究，这些复杂数据结构包含我们在谷歌工作负载中经常遇到的具有嵌套结构的列。我们重点指出了生产环境中现有方法的局限性，特别是其处理大型复杂数据结构的能力不足，以及缺乏语义连贯的测试数据，从而导致测试覆盖率有限。我们证明，通过利用大型语言模型（LLMs）并结合策略性的预处理和后处理步骤，我们能够生成语法正确且语义相关的高保真测试数据，这些数据遵循复杂的结构约束，并保持与SQL测试目标（查询/函数）的语义完整性。该方法支持对涉及连接、聚合甚至深度嵌套子查询的复杂SQL查询进行全面测试，从而确保对NL2SQL和SQL代码助手等SQL代码生成服务进行稳健评估。我们的结果证明了基于LLM（\textit{Gemini}）的测试数据生成对于工业SQL代码生成服务的实际效用，在这些服务中，由于生产数据集经常无法用于测试或难以获取，生成高保真测试数据至关重要。