Digital libraries that maintain extensive textual collections may want to further enrich their content for certain downstream applications, e.g., building knowledge graphs, semantic enrichment of documents, or implementing novel access paths. All of these applications require some text processing, either to identify relevant entities, extract semantic relationships between them, or to classify documents into some categories. However, implementing reliable, supervised workflows can become quite challenging for a digital library because suitable training data must be crafted, and reliable models must be trained. While many works focus on achieving the highest accuracy on some benchmarks, we tackle the problem from a digital library practitioner. In other words, we also consider trade-offs between accuracy and application costs, dive into training data generation through distant supervision and large language models such as ChatGPT, LLama, and Olmo, and discuss how to design final pipelines. Therefore, we focus on relation extraction and text classification, using the showcase of eight biomedical benchmarks.
翻译:维护大量文本馆藏的数字图书馆可能希望进一步丰富其内容以支持某些下游应用,例如构建知识图谱、实现文档语义增强或建立新型访问路径。所有这些应用都需要进行一定的文本处理,无论是识别相关实体、提取实体间的语义关系,还是将文档分类到特定类别。然而,对于数字图书馆而言,实施可靠的监督式工作流程可能面临较大挑战,因为需要构建合适的训练数据并训练可靠的模型。尽管许多研究致力于在特定基准测试中实现最高准确率,但本文从数字图书馆实践者的角度探讨该问题。换言之,我们同时考量准确率与应用成本之间的权衡,深入研究通过远程监督及ChatGPT、LLama、Olmo等大型语言模型生成训练数据的方法,并讨论如何设计最终的处理流程。为此,我们以八个生物医学基准测试为案例,重点研究关系抽取与文本分类任务。