Large language models have shown strong performance on broad-domain knowledge and reasoning benchmarks, but it remains unclear how well language models handle specialized animal-related knowledge under a unified closed-book evaluation protocol. We introduce BAGEL, a benchmark for evaluating animal knowledge expertise in language models. BAGEL is constructed from diverse scientific and reference sources, including bioRxiv, Global Biotic Interactions, Xeno-canto, and Wikipedia, using a combination of curated examples and automatically generated closed-book question-answer pairs. The benchmark covers multiple aspects of animal knowledge, including taxonomy, morphology, habitat, behavior, vocalization, geographic distribution, and species interactions. By focusing on closed-book evaluation, BAGEL measures animal-related knowledge of models without external retrieval at inference time. BAGEL further supports fine-grained analysis across source domains, taxonomic groups, and knowledge categories, enabling a more precise characterization of model strengths and systematic failure modes. Our benchmark provides a new testbed for studying domain-specific knowledge generalization in language models and for improving their reliability in biodiversity-related applications.
翻译:大型语言模型在广泛领域的知识与推理基准测试中表现出色,但在统一的无外部知识检索评估协议下,它们处理专业动物相关知识的能力尚不明确。我们提出BAGEL——一个用于评估语言模型动物知识专业能力的基准测试。BAGEL基于多样化的科学与参考文献源构建,包括bioRxiv、Global Biotic Interactions、Xeno-canto及Wikipedia,结合了精选示例与自动生成的封闭式问答对。该基准测试涵盖动物知识的多个维度,包括分类学、形态学、栖息地、行为、发声、地理分布及物种间相互作用。通过聚焦封闭式评估,BAGEL在推理阶段无需外部检索即可衡量模型与动物相关的知识。BAGEL还支持跨来源领域、分类群及知识类别的细粒度分析,从而更精确地刻画模型优势与系统性失效模式。我们的基准测试为研究语言模型在特定领域的知识泛化能力及提升其在生物多样性相关应用中的可靠性提供了新的试验平台。