We introduce a benchmark dataset for question answering and translation in bilingual Latin and English settings, containing about 7,800 question-answer pairs. The questions are drawn from Latin pedagogical sources, including exams, quizbowl-style trivia, and textbooks ranging from the 1800s to the present. After automated extraction, cleaning, and manual review, the dataset covers a diverse range of question types: knowledge- and skill-based, multihop reasoning, constrained translation, and mixed language pairs. To our knowledge, this is the first QA benchmark centered on Latin. As a case study, we evaluate three large language models -- LLaMa 3, Qwen QwQ, and OpenAI's o3-mini -- finding that all perform worse on skill-oriented questions. Although the reasoning models perform better on scansion and literary-device tasks, they offer limited improvement overall. QwQ performs slightly better on questions asked in Latin, but LLaMa3 and o3-mini are more task dependent. This dataset provides a new resource for assessing model capabilities in a specialized linguistic and cultural domain, and the creation process can be easily adapted for other languages. The dataset is available at: https://github.com/slanglab/RespondeoQA
翻译:我们提出一个用于拉丁语-英语双语问答与翻译的基准数据集,包含约7800对问答对。问题源自拉丁语教学资源,包括19世纪至今的考试题目、问答竞赛式冷知识及教材。经自动化提取、清洗和人工审核后,该数据集涵盖多样的问题类型:基于知识与技能的题目、多跳推理、限定翻译及混合语言对。据我们所知,这是首个以拉丁语为核心的问答基准。作为案例研究,我们评估了三个大语言模型——LLaMa 3、Qwen QwQ和OpenAI的o3-mini,发现所有模型在技能导向型问题上表现较差。尽管推理模型在韵律分析(scansion)和文学手法任务上表现更优,但整体改进有限。QwQ在拉丁语提问的问题中表现略好,而LLaMa3和o3-mini的表现更具任务依赖性。该数据集为评估模型在特定语言文化领域的能力提供了新资源,且其构建流程可轻松适配其他语言。数据集地址:https://github.com/slanglab/RespondeoQA