Machine learning is widely utilized across various industries. Identifying the appropriate machine learning models and datasets for specific tasks is crucial for the effective industrial application of machine learning. However, this requires expertise in both machine learning and the relevant domain, leading to a high learning cost. Therefore, research focused on extracting combinations of tasks, machine learning models, and datasets from academic papers is critically important, as it can facilitate the automatic recommendation of suitable methods. Conventional information extraction methods from academic papers have been limited to identifying machine learning models and other entities as named entities. To address this issue, this study proposes a methodology extracting tasks, machine learning methods, and dataset names from scientific papers and analyzing the relationships between these information by using LLM, embedding model, and network clustering. The proposed method's expression extraction performance, when using Llama3, achieves an F-score exceeding 0.8 across various categories, confirming its practical utility. Benchmarking results on financial domain papers have demonstrated the effectiveness of this method, providing insights into the use of the latest datasets, including those related to ESG (Environmental, Social, and Governance) data.
翻译:机器学习技术已在各行业得到广泛应用。针对特定任务选择合适的机器学习模型与数据集对于实现机器学习在产业中的有效应用至关重要。然而,这需要同时具备机器学习及相关领域的专业知识,导致学习成本高昂。因此,从学术论文中提取任务、机器学习模型与数据集的组合研究具有关键意义,因其能够促进合适方法的自动推荐。传统的学术论文信息提取方法仅限于将机器学习模型及其他实体作为命名实体进行识别。为解决此问题,本研究提出一种利用LLM、嵌入模型与网络聚类技术,从科学论文中提取任务、机器学习方法及数据集名称,并分析这些信息间关联关系的方法论。所提方法在使用Llama3时,其表述提取性能在各类别上的F值均超过0.8,证实了其实用性。在金融领域论文上的基准测试结果验证了该方法的有效性,为包括ESG(环境、社会与治理)数据在内的最新数据集使用提供了洞见。