KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

from arxiv, Accepted at the Third Conference on Parsimony and Learning (CPAL 2026). 36 pages, 12 figures. (Equal contribution: Yasaman Amou Jafari and Mahdi Noori.)

With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.

翻译：随着大语言模型（LLM）的兴起，其在检索增强生成（RAG）等应用中发挥着关键作用。然而，评估这些系统仍然受限于构建专业评估数据集所需的时间和成本。本文提出KNIGHT，一个基于LLM、知识图谱驱动的框架，用于从外部源生成多项选择题（MCQ）数据集。KNIGHT构建特定主题的知识图谱，即实体与关系的结构化精简摘要，该图谱可被复用以生成由教师控制的难度级别（包括多跳问题），而无需反复重新输入完整源文本。该知识图谱作为一种压缩、可重用的状态，使得问题生成成为对图谱的低成本读取操作。我们在Wikipedia/Wikidata上实例化KNIGHT，同时保持该框架与领域和本体无关。作为案例研究，KNIGHT在历史、生物学和数学领域生成了六个MCQ数据集。我们从五个标准评估生成质量：流畅性、明确性（单一正确答案）、主题相关性、选项唯一性，以及基于所提供来源的可回答性（作为幻觉的代理指标）。结果表明，KNIGHT能够通过可重用的图谱表示实现高效且低成本的生成，在所有标准上均达到高质量，产生的模型排名与MMLU风格基准保持一致，同时支持主题特定和难度可控的评估。