Extraction of Research Objectives, Machine Learning Model Names, and Dataset Names from Academic Papers and Analysis of Their Interrelationships Using LLM and Network Analysis

Machine Learning · Learning · MoDELS · Analysis · 机器学习模型 ·

2024 年 8 月 22 日

翻译：基于LLM与网络分析的学术论文研究目标、机器学习模型及数据集名称提取及其关联性分析

S. Nishio,H. Nonaka,N. Tsuchiya,A. Migita,Y. Banno,T. Hayashi,H. Sakaji,T. Sakumoto,K. Watabe

from arxiv, 10 pages, 8 figures

Machine learning is widely utilized across various industries. Identifying the appropriate machine learning models and datasets for specific tasks is crucial for the effective industrial application of machine learning. However, this requires expertise in both machine learning and the relevant domain, leading to a high learning cost. Therefore, research focused on extracting combinations of tasks, machine learning models, and datasets from academic papers is critically important, as it can facilitate the automatic recommendation of suitable methods. Conventional information extraction methods from academic papers have been limited to identifying machine learning models and other entities as named entities. To address this issue, this study proposes a methodology extracting tasks, machine learning methods, and dataset names from scientific papers and analyzing the relationships between these information by using LLM, embedding model, and network clustering. The proposed method's expression extraction performance, when using Llama3, achieves an F-score exceeding 0.8 across various categories, confirming its practical utility. Benchmarking results on financial domain papers have demonstrated the effectiveness of this method, providing insights into the use of the latest datasets, including those related to ESG (Environmental, Social, and Governance) data.

翻译：机器学习技术已在各行业得到广泛应用。针对特定任务选择合适的机器学习模型与数据集对于实现机器学习在产业中的有效应用至关重要。然而，这需要同时具备机器学习及相关领域的专业知识，导致学习成本高昂。因此，从学术论文中提取任务、机器学习模型与数据集的组合研究具有关键意义，因其能够促进合适方法的自动推荐。传统的学术论文信息提取方法仅限于将机器学习模型及其他实体作为命名实体进行识别。为解决此问题，本研究提出一种利用LLM、嵌入模型与网络聚类技术，从科学论文中提取任务、机器学习方法及数据集名称，并分析这些信息间关联关系的方法论。所提方法在使用Llama3时，其表述提取性能在各类别上的F值均超过0.8，证实了其实用性。在金融领域论文上的基准测试结果验证了该方法的有效性，为包括ESG（环境、社会与治理）数据在内的最新数据集使用提供了洞见。

相关内容

Machine Learning

关注 2251

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日