STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator

The increasing reliance on Large Language Models (LLMs) across diverse sectors highlights the need for robust domain-specific and language-specific evaluation datasets; however, the collection of such datasets is challenging due to privacy concerns, regulatory restrictions, and the time cost for manual creation. Existing automated benchmarking methods are often limited by relying on pre-existing data, poor scalability, single-domain focus, and lack of multilingual support. We present STELLAR-E - a fully automated system to generate high-quality synthetic datasets of custom size, using minimal human inputs without depending on existing datasets. The system is structured in two stages: (1) We modify the TGRT Self-Instruct framework to create a synthetic data engine that enables controllable, custom synthetic dataset generation, and (2) an evaluation pipeline incorporating statistical and LLM-based metrics to assess the applicability of the synthetic dataset for LLM-based application evaluations. The synthetic datasets reach an average difference of +5.7% in terms of LLM-as-a-judge scores against existing language-specific benchmarks, demonstrating comparable quality for comprehensive assessment of big and small LLMs. While real datasets remain slightly more challenging for LLMs especially for smaller models, this work establishes a scalable and domain-adaptable benchmarking framework that supports fair evaluation of LLM applications, offering a faster alternative to manual approaches and enabling high-efficiency automated quality assurance cycles.

翻译：随着大语言模型（LLMs）在各领域的日益普及，对稳健的领域特定与语言特定评估数据集的需求愈发迫切；然而，由于隐私问题、监管限制以及人工创建的时间成本，此类数据集的收集面临重重挑战。现有自动化基准测试方法往往受限于依赖既有数据、可扩展性差、单领域聚焦及缺乏多语言支持。我们提出STELLAR-E——一种全自动系统，可在无需依赖现有数据集且仅需最少人工输入的情况下，生成高质量、可定制规模的人工合成数据集。该系统分为两个阶段：（1）基于TGRT Self-Instruct框架进行改进，构建合成数据引擎，实现可控、定制化的合成数据集生成；（2）构建结合统计指标与基于LLM的指标的评估流水线，用于评估合成数据集在基于LLM的应用评估中的适用性。在基于LLM作为评判的评分中，合成数据集与现有语言特定基准相比，平均差异仅为+5.7%，展现出对大型及小型LLM进行全面评估的可比质量。尽管真实数据集对LLM（尤其是小型模型）仍具略微更高的挑战性，本工作建立了一个可扩展且领域自适应的基准测试框架，支持对LLM应用的公平评估，为人工方法提供了更快速的替代方案，并实现了高效的自动化质量保障循环。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

评估大语言模型在科学发现中的作用

专知会员服务

19+阅读 · 2025年12月19日

【AAAI2026】NeSTR：一种用于大型语言模型的神经-符号可溯因框架，用于时间推理

专知会员服务

17+阅读 · 2025年12月10日

什么是后训练？大语言模型训练后优化方法综述，87页pdf

专知会员服务

54+阅读 · 2025年3月11日

【NeurIPS2024】《AmoebaLLM：构建任意形状的大型语言模型以实现高效和即时部署》

专知会员服务

22+阅读 · 2024年11月21日