Can large language models converse in languages virtually absent from their training data? We investigate this question through a case study on Tulu, a Dravidian language with over 2 million speakers but minimal digital presence. Rather than fine-tuning an LLM, we examine whether structured prompts alone can elicit basic conversational ability under controlled prompting. We systematically tackle various challenges posed by absence of training data for Tulu by combining explicit grammar documentation, negative constraints to suppress high-probability tokens from related languages, romanization standardization, and quality-controlled synthetic data generation via self-play. Evaluated on a manually curated held-out set across three LLMs (Gemini 2.0 Flash, GPT-4o, Llama 3.1 70B) and validated by native speakers, our approach reduces vocabulary contamination from 80% to 5% while achieving 85% grammatical accuracy. Cross-model analysis reveals that negative constraints provide consistent improvements (12--18 percentage points), while grammar documentation effects vary by model architecture (8--22 points).
翻译:大型语言模型能否用训练数据中几乎不存在的语言进行对话?我们通过对图鲁语(一种拥有超过200万使用者但数字存在极少的达罗毗荼语系语言)的案例研究来探讨这个问题。不同于微调LLM,我们研究在受控提示下,仅通过结构化提示能否激发基本的对话能力。我们通过结合显式语法文档、抑制相关语言高概率词元的负约束、罗马化标准化,以及通过自我对弈生成质量可控的合成数据,系统性地解决了图鲁语因缺乏训练数据带来的各种挑战。在三个LLM(Gemini 2.0 Flash、GPT-4o、Llama 3.1 70B)上基于人工标注的保留集进行评估,并经母语者验证,我们的方法将词汇污染率从80%降低至5%,同时实现了85%的语法准确率。跨模型分析表明,负约束能带来一致的性能提升(12-18个百分点),而语法文档的效果则因模型架构而异(8-22个百分点)。