Despite the usefulness of machine learning approaches for the early screening of potential breakthrough technologies, their practicality is often hindered by opaque models. To address this, we propose an interpretable machine learning approach to predicting future citation counts from patent texts using a patent-specific hierarchical attention network (PatentHAN) model. Central to this approach are (1) a patent-specific pre-trained language model, capturing the meanings of technical words in patent claims, (2) a hierarchical network structure, enabling detailed analysis at the claim level, and (3) a claim-wise self-attention mechanism, revealing pivotal claims during the screening process. A case study of 35,376 pharmaceutical patents demonstrates the effectiveness of our approach in early screening of potential breakthrough technologies while ensuring interpretability. Furthermore, we conduct additional analyses using different language models and claim types to examine the robustness of the approach. It is expected that the proposed approach will enhance expert-machine collaboration in identifying breakthrough technologies, providing new insight derived from text mining into technological value.
翻译:尽管机器学习方法在潜在突破性技术的早期筛查中具有实用性,但其应用常因模型不透明而受限。为此,我们提出一种可解释的机器学习方法,通过专利专用的分层注意力网络(PatentHAN)模型,依据专利文本预测未来引用次数。该方法的核心包括:(1)专利专用的预训练语言模型,用于捕捉专利权利要求中技术词汇的含义;(2)分层网络结构,支持在权利要求级别进行详细分析;(3)权利要求级别的自注意力机制,揭示筛查过程中的关键权利要求。通过对35,376项药物专利的案例研究,证明了该方法在确保可解释性的同时,能有效实现潜在突破性技术的早期筛查。此外,我们使用不同的语言模型和权利要求类型进行了补充分析,以检验该方法的稳健性。预期所提出的方法将加强专家与机器在识别突破性技术方面的协作,为从文本挖掘中获取技术价值提供新的见解。