AthDGC ("Athens-PROIEL") is an open, end-to-end workflow and dataset. It is, to the best of our knowledge, the first openly licensed dependency-parsed treebank of Greek that spans eight diachronic periods, namely Archaic, Classical, Koine, Late Antique, Byzantine, Late Byzantine, Early Modern, and Modern Greek, under a single PROIEL XML 2.0 schema, with verse-level cross-alignment of the New Testament to Latin (Vulgate), Gothic (Wulfila), Old Church Slavonic (Marianus), and Classical Armenian. AthDGC builds on the PROIEL Treebank Family (Haug and Johndal 2008; Eckhoff et al. 2018), which established the schema and the Koine-Greek reference set for the project. Annotation uses the Stanford Stanza PROIEL-trained workflow; sentence-level alignment uses LaBSE, a multilingual sentence-embedding model; word-level alignment uses multilingual-BERT attention through the AwesomeAlign procedure. The v0.4 release provides curated samples and the open-source toolkit; the full annotated corpus partitions remain under v0.5 audit on the Greek national HPC. Quantitative scale, per-witness verse counts, and per-period annotated-row counts are reported in the v0.5 release notes, after the audit pass completes. Concept DOI: 10.5281/zenodo.20439182.
翻译:阿蒂卡历时希腊语依存树库("Athens-PROIEL")是一个开放的端到端工作流与数据集。据我们所知,它是首个在单一PROIEL XML 2.0架构下,跨越八个历时阶段(即古风期、古典期、通用希腊语期、晚期古典期、拜占庭期、晚期拜占庭期、早期现代期及现代希腊语期)并遵循开源许可的依存句法树库,同时包含《新约》与拉丁语(武加大译本)、哥特语(乌尔菲拉译本)、古教会斯拉夫语(马里亚努斯抄本)及古典亚美尼亚语在诗句级层面的交叉对齐。该树库基于PROIEL树库家族(Haug and Johndal 2008; Eckhoff et al. 2018)构建,后者为本项目确立了数据架构及通用希腊语参考集。标注工作采用斯坦福Stanza的PROIEL训练工作流,句级对齐使用多语句子嵌入模型LaBSE,词级对齐则通过AwesomeAlign流程调用多语言BERT注意力机制。v0.4版本提供精选样本与开源工具包;完整的标注语料分区目前仍在希腊国家高性能计算平台上接受v0.5审核。定量规模(各见证段诗句数及各时期标注行数)将在审核通过后于v0.5版本发布说明中公布。概念DOI:10.5281/zenodo.20439182。