The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support. We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets. The system is structured in two stages: (1) We modify the TGRT Self-Instruct framework to create a synthetic data engine that enables controllable, custom synthetic dataset generation, and (2) an evaluation pipeline incorporating statistical and LLM-based metrics to assess the applicability of the synthetic dataset for LLM-based application evaluations. The synthetic datasets reach an average difference of +5.7% in terms of LLM-as-a-judge scores against existing language-specific benchmarks, demonstrating comparable quality for comprehensive assessment of big and small LLMs. While real datasets remain slightly more challenging for LLMs especially for smaller models, this work establishes a scalable and domain-adaptable benchmarking framework that supports fair evaluation of LLM applications, offering a faster alternative to manual approaches and enabling high-efficiency automated quality assurance cycles.
翻译:随着大语言模型(LLMs)在各领域的日益普及,对稳健的领域特定与语言特定评估数据集的需求愈发迫切;然而,由于隐私问题、监管限制以及人工创建的时间成本,此类数据集的收集面临重重挑战。现有自动化基准测试方法往往受限于依赖既有数据、可扩展性差、单领域聚焦及缺乏多语言支持。我们提出STELLAR-E——一种全自动系统,可在无需依赖现有数据集且仅需最少人工输入的情况下,生成高质量、可定制规模的人工合成数据集。该系统分为两个阶段:(1)基于TGRT Self-Instruct框架进行改进,构建合成数据引擎,实现可控、定制化的合成数据集生成;(2)构建结合统计指标与基于LLM的指标的评估流水线,用于评估合成数据集在基于LLM的应用评估中的适用性。在基于LLM作为评判的评分中,合成数据集与现有语言特定基准相比,平均差异仅为+5.7%,展现出对大型及小型LLM进行全面评估的可比质量。尽管真实数据集对LLM(尤其是小型模型)仍具略微更高的挑战性,本工作建立了一个可扩展且领域自适应的基准测试框架,支持对LLM应用的公平评估,为人工方法提供了更快速的替代方案,并实现了高效的自动化质量保障循环。