Long-context question answering (QA) over literary texts poses significant challenges for modern large language models, particularly in low-resource languages. We address the scarcity of long-context QA resources for Indic languages by introducing LittiChoQA, the largest literary QA dataset to date covering many languages spoken in the Gangetic plains of India. The dataset comprises over 270K automatically generated question-answer pairs with a balanced distribution of factoid and non-factoid questions, generated from naturally authored literary texts collected from the open web. We evaluate multiple multilingual LLMs on non-factoid, abstractive QA, under both full-context and context-shortened settings. Results demonstrate a clear trade-off between performance and efficiency: full-context fine-tuning yields the highest token-level and semantic-level scores, while context shortening substantially improves throughput. Among the evaluated models, Krutrim-2 achieves the strongest performance, obtaining a semantic score of 76.1 with full context. While, in shortened context settings it scores 74.9 with answer paragraph selection and 71.4 with vector-based retrieval. Qualitative evaluations further corroborate these findings.
翻译:针对文学文本的长上下文问答对现代大型语言模型提出了重大挑战,尤其在低资源语言中。我们通过引入LittiChoQA,解决了印度语言长上下文问答资源稀缺的问题,这是迄今为止覆盖印度恒河平原地区多种语言的最大文学问答数据集。该数据集包含超过27万个自动生成的问题-答案对,其事实型与非事实型问题分布均衡,生成素材来源于从开放网络收集的自然创作文学文本。我们在完整上下文和上下文缩短两种设置下,评估了多个多语言大语言模型在非事实型、抽象式问答任务上的表现。结果表明性能与效率之间存在明显的权衡:完整上下文微调获得了最高的词元级和语义级分数,而上下文缩短则显著提升了吞吐量。在评估的模型中,Krutrim-2取得了最强的性能,在完整上下文设置下获得76.1的语义分数。而在缩短上下文设置中,通过答案段落选择获得74.9分,通过基于向量的检索获得71.4分。定性评估进一步佐证了这些发现。