Despite the extensive applications of relation extraction (RE) tasks in various domains, little has been explored in the historical context, which contains promising data across hundreds and thousands of years. To promote the historical RE research, we present HistRED constructed from Yeonhaengnok. Yeonhaengnok is a collection of records originally written in Hanja, the classical Chinese writing, which has later been translated into Korean. HistRED provides bilingual annotations such that RE can be performed on Korean and Hanja texts. In addition, HistRED supports various self-contained subtexts with different lengths, from a sentence level to a document level, supporting diverse context settings for researchers to evaluate the robustness of their RE models. To demonstrate the usefulness of our dataset, we propose a bilingual RE model that leverages both Korean and Hanja contexts to predict relations between entities. Our model outperforms monolingual baselines on HistRED, showing that employing multiple language contexts supplements the RE predictions. The dataset is publicly available at: https://huggingface.co/datasets/Soyoung/HistRED under CC BY-NC-ND 4.0 license.
翻译:尽管关系抽取(RE)任务在各领域有广泛应用,但在历史语境中的探索仍十分有限,而该领域蕴含跨越数百上千年的丰富数据。为推动历史关系抽取研究,我们基于《燕行录》构建了HistRED数据集。《燕行录》是原以汉文(即古汉语书写方式)记载后来被翻译为韩文的记录汇编。HistRED提供双语标注,使得关系抽取可同时应用于韩文和汉文文本。此外,HistRED支持从句子级到文档级的不同长度自包含子文本,提供多样化上下文设置,便于研究人员评估关系抽取模型的鲁棒性。为展示数据集的价值,我们提出了一种双语关系抽取模型,该模型同时利用韩文和汉文的上下文来预测实体间关系。我们的模型在HistRED上优于单语基线,表明采用多语言上下文能够增强关系抽取预测。该数据集以CC BY-NC-ND 4.0许可协议公开于:https://huggingface.co/datasets/Soyoung/HistRED。