Topic localization aims to identify spans of text that express a given topic defined by a name and description. To study this task, we introduce a human-annotated benchmark based on Czech historical documents, containing human-defined topics together with manually annotated spans and supporting evaluation at both document and word levels. Evaluation is performed relative to human agreement rather than a single reference annotation. We evaluate a diverse range of large language models alongside BERT-based models fine-tuned on a distilled development dataset. Results reveal substantial variability among LLMs, with performance ranging from near-human topic detection to pronounced failures in span localization. While the strongest models approach human agreement, the distilled token embedding models remain competitive despite their smaller scale. The dataset and evaluation framework are publicly available at: https://github.com/dcgm/czechtopic.
翻译:主题定位旨在识别表达给定主题(由名称和描述定义)的文本片段。为研究此任务,我们引入了一个基于捷克历史文献的人工标注基准,其中包含人工定义的主题以及手动标注的文本片段,并支持在文档和词语两个层面进行评估。评估是相对于人工标注一致性而非单一参考标注进行的。我们评估了多样化的大型语言模型以及在蒸馏开发数据集上微调的基于BERT的模型。结果显示,大型语言模型之间存在显著差异,其性能范围从接近人类水平的主题检测到跨度定位的明显失败。尽管最强模型接近人类标注一致性,但蒸馏后的词嵌入模型尽管规模较小,仍保持竞争力。数据集和评估框架已在以下网址公开提供:https://github.com/dcgm/czechtopic。