Comparative Analysis of 47 Context-Based Question Answer Models Across 8 Diverse Datasets

Context-based question answering (CBQA) models provide more accurate and relevant answers by considering the contextual information. They effectively extract specific information given a context, making them functional in various applications involving user support, information retrieval, and educational platforms. In this manuscript, we benchmarked the performance of 47 CBQA models from Hugging Face on eight different datasets. This study aims to identify the best-performing model across diverse datasets without additional fine-tuning. It is valuable for practical applications where the need to retrain models for specific datasets is minimized, streamlining the implementation of these models in various contexts. The best-performing models were trained on the SQuAD v2 or SQuAD v1 datasets. The best-performing model was ahotrod/electra_large_discriminator_squad2_512, which yielded 43\% accuracy across all datasets. We observed that the computation time of all models depends on the context length and the model size. The model's performance usually decreases with an increase in the answer length. Moreover, the model's performance depends on the context complexity. We also used the Genetic algorithm to improve the overall accuracy by integrating responses from other models. ahotrod/electra_large_discriminator_squad2_512 generated the best results for bioasq10b-factoid (65.92\%), biomedical\_cpgQA (96.45\%), QuAC (11.13\%), and Question Answer Dataset (41.6\%). Bert-large-uncased-whole-word-masking-finetuned-squad achieved an accuracy of 82\% on the IELTS dataset.

翻译：基于上下文的问答模型通过考虑上下文信息提供更准确和相关的答案。这些模型能够根据给定上下文有效提取特定信息，使其在用户支持、信息检索和教育平台等各种应用中发挥功能。在本研究中，我们在八个不同数据集上对来自Hugging Face的47种CBQA模型进行了性能基准测试。本研究旨在识别在不同数据集上表现最佳的模型，而无需额外微调。这对于实际应用具有重要价值，可最大限度地减少针对特定数据集重新训练模型的需求，从而简化这些模型在各种场景中的部署。表现最佳的模型均在SQuAD v2或SQuAD v1数据集上进行训练。最佳性能模型为ahotrod/electra_large_discriminator_squad2_512，在所有数据集上取得了43%的准确率。我们观察到所有模型的计算时间取决于上下文长度和模型规模。模型性能通常随答案长度的增加而下降。此外，模型性能还取决于上下文复杂度。我们还采用遗传算法通过整合其他模型的响应来提高整体准确率。ahotrod/electra_large_discriminator_squad2_512在bioasq10b-factoid（65.92%）、biomedical_cpgQA（96.45%）、QuAC（11.13%）和Question Answer Dataset（41.6%）数据集上取得了最佳结果。Bert-large-uncased-whole-word-masking-finetuned-squad在IELTS数据集上达到了82%的准确率。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

DeepSeek模型综述：V1 V2 V3 R1-Zero

专知会员服务

116+阅读 · 2025年2月11日

DiffRec: 扩散推荐模型（SIGIR'23）

专知会员服务

48+阅读 · 2023年4月16日

《用于代码弱点识别的 LLVM 中间表示》CMU

专知会员服务

14+阅读 · 2022年12月12日

【MIT-ICLR2022】在机器学习模型中注入公平性, Injecting fairness into machine-learning models

专知会员服务

22+阅读 · 2022年3月7日