Retrieval-augmented generation (RAG) systems are widely used in question-answering (QA) tasks, but current benchmarks lack metadata integration, limiting their evaluation in scenarios requiring both textual data and external information. To address this, we present AMAQA, a new open-access QA dataset designed to evaluate tasks combining text and metadata. The integration of metadata is especially important in fields that require rapid analysis of large volumes of data, such as cybersecurity and intelligence, where timely access to relevant information is critical. AMAQA includes about 1.1 million English messages collected from 26 public Telegram groups, enriched with metadata such as timestamps and chat names. It also contains 20,000 hotel reviews with metadata. In addition, the dataset provides 2,600 high-quality QA pairs built across both domains, Telegram messages and hotel reviews, making AMAQA a valuable resource for advancing research on metadata-driven QA and RAG systems. Both Telegram messages and Hotel reviews are enriched with emotional tones or toxicity indicators. To the best of our knowledge, AMAQA is the first single-hop QA benchmark to incorporate metadata. We conduct extensive tests on the benchmark, setting a new reference point for future research. We show that leveraging metadata boosts accuracy from 0.5 to 0.86 for GPT-4o and from 0.27 to 0.76 for open source LLMs, highlighting the value of structured context. We conducted experiments on our benchmark to assess the performance of known techniques designed to enhance RAG, highlighting the importance of properly managing metadata throughout the entire RAG pipeline.
翻译:检索增强生成(RAG)系统在问答(QA)任务中得到了广泛应用,但现有基准测试缺乏元数据整合,限制了其在需要同时处理文本数据和外部信息的场景下的评估能力。为此,我们提出了AMAQA,一个旨在评估结合文本与元数据的任务的新型开放访问问答数据集。元数据的整合在需要快速分析海量数据的领域尤为重要,例如网络安全和情报分析,其中及时获取相关信息至关重要。AMAQA包含从26个公开Telegram群组收集的约110万条英文消息,并附有时间戳和聊天名称等元数据。该数据集还包含20,000条带有元数据的酒店评论。此外,AMAQA提供了在Telegram消息和酒店评论这两个领域构建的2,600对高质量问答对,使其成为推动基于元数据的问答及RAG系统研究的重要资源。Telegram消息和酒店评论均补充了情感倾向或毒性指标。据我们所知,AMAQA是首个融入元数据的单跳问答基准测试。我们在该基准上进行了广泛测试,为未来研究设立了新的参考标准。实验表明,利用元数据可将GPT-4o的准确率从0.5提升至0.86,并将开源大型语言模型的准确率从0.27提升至0.76,凸显了结构化上下文的价值。我们在该基准上进行了实验,评估了旨在增强RAG性能的已知技术的表现,强调了在整个RAG流程中妥善管理元数据的重要性。