IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and translated datasets. Our experiment reveal substantial variation across languages, reflecting language-dependent biases and inconsistent semantic preservation that directly affect the reliability of cross-lingual dataset reuse. Overall, this study highlights both the potential and limitations of LLM-assisted dataset creation for low-resource IR. It provides empirical evidence of the risks associated with cross-lingual dataset reuse and offers practical guidance for constructing more reliable benchmarks and evaluation pipelines in low-resource language settings.
翻译:低资源语言的信息检索仍受限于高质量、任务特定标注数据集的稀缺。人工标注成本高昂且难以扩展,而使用大语言模型作为自动标注器则引发了标签可靠性、偏见和评估有效性方面的担忧。本研究提出了一个使用BETA标注框架构建的孟加拉语信息检索数据集,该框架涉及来自不同模型家族的多个大语言模型标注器。该框架整合了上下文对齐、一致性检查和多数同意机制,并通过人工评估验证标签质量。除数据集创建外,我们还探究了其他低资源语言的信息检索数据集是否可通过单跳机器翻译有效复用。通过在多语言对上使用基于大语言模型的翻译,我们实验验证了源数据集与翻译数据集之间的语义保持度和任务有效性。实验结果显示不同语言间存在显著差异,反映了语言依赖性偏见和语义保持的不一致性,这些因素直接影响跨语言数据集复用的可靠性。总体而言,本研究既揭示了大语言模型辅助低资源信息检索数据集构建的潜力,也指出了其局限性。研究为跨语言数据集复用相关的风险提供了实证依据,并为在低资源语言环境下构建更可靠的基准测试和评估流程提供了实践指导。