ScholarChemQA: Unveiling the Power of Language Models in Chemical Research Question Answering

Question Answering (QA) effectively evaluates language models' reasoning and knowledge depth. While QA datasets are plentiful in areas like general domain and biomedicine, academic chemistry is less explored. Chemical QA plays a crucial role in both education and research by effectively translating complex chemical information into readily understandable format. Addressing this gap, we introduce ScholarChemQA, a large-scale QA dataset constructed from chemical papers. This dataset reflects typical real-world challenges, including an imbalanced data distribution and a substantial amount of unlabeled data that can be potentially useful. Correspondingly, we introduce a QAMatch model, specifically designed to effectively answer chemical questions by fully leveraging our collected data. We first address the issue of imbalanced label distribution by re-weighting the instance-wise loss based on the inverse frequency of each class, ensuring minority classes are not dominated by majority ones during optimization. Next, we utilize the unlabeled data to enrich the learning process, generating a variety of augmentations based on a SoftMix operation and ensuring their predictions align with the same target, i.e., pseudo-labels. To ensure the quality of the pseudo-labels, we propose a calibration procedure aimed at closely aligning the pseudo-label estimates of individual samples with a desired ground truth distribution. Experiments show that our QAMatch significantly outperforms the recent similar-scale baselines and Large Language Models (LLMs) not only on our ScholarChemQA dataset but also on four benchmark datasets. We hope our benchmark and model can facilitate and promote more research on chemical QA.

翻译：问答（QA）能有效评估语言模型的推理能力与知识深度。尽管通用领域和生物医学等领域已存在大量QA数据集，学术化学领域的探索仍相对不足。化学问答通过将复杂化学信息有效转化为易于理解的格式，在教育和研究中发挥着关键作用。为填补这一空白，我们推出了ScholarChemQA——一个基于化学论文构建的大规模QA数据集。该数据集反映了典型的现实挑战，包括不平衡的数据分布和大量可能具有潜在价值的未标注数据。相应地，我们提出了QAMatch模型，该模型专为充分利用我们收集的数据以有效回答化学问题而设计。我们首先通过基于类别逆频率重新加权实例损失来解决标签分布不平衡问题，确保优化过程中少数类别不被多数类别主导。随后，我们利用未标注数据丰富学习过程：基于SoftMix操作生成多样化的数据增强，并确保其预测与同一目标（即伪标签）保持一致。为保证伪标签质量，我们提出了一种校准流程，旨在使单个样本的伪标签估计值尽可能接近期望的真实分布。实验表明，我们的QAMatch不仅在ScholarChemQA数据集上显著优于近期同类规模基线模型及大语言模型（LLMs），在四个基准数据集上也表现出优越性能。我们希望本基准数据集与模型能够推动化学问答领域的进一步研究。