Text-to-SQL systems have achieved strong performance on English benchmarks, yet their behavior in morphologically rich, low-resource languages remains largely unexplored. We introduce BIRDTurk, the first Turkish adaptation of the BIRD benchmark, constructed through a controlled translation pipeline that adapts schema identifiers to Turkish while strictly preserving the logical structure and execution semantics of SQL queries and databases. Translation quality is validated on a sample size determined by the Central Limit Theorem to ensure 95% confidence, achieving 98.15% accuracy on human-evaluated samples. Using BIRDTurk, we evaluate inference-based prompting, agentic multi-stage reasoning, and supervised fine-tuning. Our results reveal that Turkish introduces consistent performance degradation, driven by both structural linguistic divergence and underrepresentation in LLM pretraining, while agentic reasoning demonstrates stronger cross-lingual robustness. Supervised fine-tuning remains challenging for standard multilingual baselines but scales effectively with modern instruction-tuned models. BIRDTurk provides a controlled testbed for cross-lingual Text-to-SQL evaluation under realistic database conditions. We release the training and development splits to support future research.
翻译:文本到SQL系统在英语基准测试中已取得强劲性能,但其在形态丰富、资源稀缺语言中的表现仍基本未被探索。我们提出了BIRDTurk,这是BIRD基准的首个土耳其语适配版本,通过一个受控的翻译流程构建而成。该流程将模式标识符适配为土耳其语,同时严格保留SQL查询与数据库的逻辑结构及执行语义。翻译质量基于中心极限定理确定的样本量进行验证,以确保95%的置信度,在人工评估样本上达到了98.15%的准确率。利用BIRDTurk,我们评估了基于推理的提示、智能体多阶段推理以及监督微调。我们的结果表明,土耳其语会导致一致性的性能下降,这既源于结构性的语言差异,也源于其在大型语言模型预训练中的代表性不足;而智能体推理则展现出更强的跨语言鲁棒性。对于标准的多语言基线模型,监督微调仍具挑战性,但能随着现代指令调优模型有效扩展。BIRDTurk为真实数据库条件下的跨语言文本到SQL评估提供了一个受控测试平台。我们公开发布训练集与开发集,以支持未来研究。