Mechanistic interpretability research faces a gap between analyzing simple circuits in toy tasks and discovering features in large models. To bridge this gap, we propose text-to-SQL generation as an ideal task to study, as it combines the formal structure of toy tasks with real-world complexity. We introduce TinySQL, a synthetic dataset, progressing from basic to advanced SQL operations, and train models ranging from 33M to 1B parameters to establish a comprehensive testbed for interpretability. We apply multiple complementary interpretability techniques, including Edge Attribution Patching and Sparse Autoencoders, to identify minimal circuits and components supporting SQL generation. We compare circuits for different SQL subskills, evaluating their minimality, reliability, and identifiability. Finally, we conduct a layerwise logit lens analysis to reveal how models compose SQL queries across layers: from intent recognition to schema resolution to structured generation. Our work provides a robust framework for probing and comparing interpretability methods in a structured, progressively complex setting.
翻译:机制可解释性研究面临一个鸿沟:一方面是分析玩具任务中的简单电路,另一方面是发现大型模型中的特征。为弥合这一鸿沟,我们提出将文本到SQL生成作为理想的研究任务,因为它结合了玩具任务的形式化结构与现实世界的复杂性。我们引入了TinySQL——一个从基础到高级SQL操作渐进演进的合成数据集,并训练了参数量从33M到1B不等的模型,从而建立了一个全面的可解释性测试平台。我们应用了多种互补的可解释性技术(包括Edge Attribution Patching和Sparse Autoencoders)来识别支持SQL生成的最小电路与组件。通过比较不同SQL子技能对应的电路,我们评估了其最小性、可靠性与可识别性。最后,我们进行了分层Logit Lens分析,揭示了模型如何跨层组合SQL查询:从意图识别到模式解析,再到结构化生成。本研究为在结构化、渐进复杂的场景中探索和比较可解释性方法提供了稳健的框架。