AmharicIR+Instr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning

Neural retrieval and GPT-style generative models rely on large, high-quality supervised data, which is still scarce for low-resource languages such as Amharic. We release an Amharic data resource consisting of two datasets that supports research on (i) neural retrieval-ranking and (ii) instruction-following text generation. The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets drawn from diverse Amharic sources and constructed to support contrastive training and benchmarking of neural retrievers (e.g., DPR, ColBERT-style late interaction and SPLADE-style sparse neural retrieval). Triplets are created through a combination of expert-curated queries, web-derived queries, and LLM-assisted generation, with positive/negative documents selected from the web or synthesized by LLMs and then validated by native speakers. The instruction prompt-response dataset comprises 6,285 Amharic prompt-response pairs spanning multiple domains and instruction types, generated with several LLMs and refined through manual review and correction for grammaticality, relevance, fluency, and factual plausibility. We release both datasets with standardized splits and formats (CSV,JSON,JSONL) to enable reproducible work on Amharic retrieval, ranking, and generative modelling. These datasets also come with a methodology that can be generalized to other low-resource languages.

翻译：神经检索与GPT风格的生成模型依赖于大规模、高质量的监督数据，而此类数据对于阿姆哈拉语等低资源语言仍然稀缺。我们发布了一个阿姆哈拉语数据资源，包含两个数据集，分别支持（i）神经检索排序与（ii）遵循指令的文本生成研究。检索排序数据集包含1,091个经人工验证的查询-正例-负例文档三元组，这些数据提取自多样化的阿姆哈拉语来源，旨在支持神经检索模型（如DPR、ColBERT风格的延迟交互模型以及SPLADE风格的稀疏神经检索）的对比训练与基准测试。三元组通过专家精心设计的查询、网络衍生查询以及大语言模型辅助生成相结合的方式构建，其正例/负例文档选自网络或由大语言模型合成，并经过母语者验证。指令提示-回复数据集包含6,285个阿姆哈拉语提示-回复对，涵盖多个领域与指令类型，由多种大语言模型生成，并经过人工审核与修正，以确保语法正确性、相关性、流畅度及事实合理性。我们以标准化划分与格式（CSV、JSON、JSONL）发布这两个数据集，以支持阿姆哈拉语检索、排序与生成建模的可复现研究。这些数据集还附带一套可推广至其他低资源语言的方法论。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。