With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.
翻译:随着大语言模型(LLM)的兴起,其在检索增强生成(RAG)等应用中发挥着关键作用。然而,评估这些系统仍然受限于构建专业评估数据集所需的时间和成本。本文提出KNIGHT,一个基于LLM、知识图谱驱动的框架,用于从外部源生成多项选择题(MCQ)数据集。KNIGHT构建特定主题的知识图谱,即实体与关系的结构化精简摘要,该图谱可被复用以生成由教师控制的难度级别(包括多跳问题),而无需反复重新输入完整源文本。该知识图谱作为一种压缩、可重用的状态,使得问题生成成为对图谱的低成本读取操作。我们在Wikipedia/Wikidata上实例化KNIGHT,同时保持该框架与领域和本体无关。作为案例研究,KNIGHT在历史、生物学和数学领域生成了六个MCQ数据集。我们从五个标准评估生成质量:流畅性、明确性(单一正确答案)、主题相关性、选项唯一性,以及基于所提供来源的可回答性(作为幻觉的代理指标)。结果表明,KNIGHT能够通过可重用的图谱表示实现高效且低成本的生成,在所有标准上均达到高质量,产生的模型排名与MMLU风格基准保持一致,同时支持主题特定和难度可控的评估。