The rapid expansion of research across machine learning, vision, and language has produced a volume of publications that is increasingly difficult to synthesize. Traditional bibliometric tools rely mainly on metadata and offer limited visibility into the semantic content of papers, making it hard to track how research themes evolve over time or how different areas influence one another. To obtain a clearer picture of recent developments, we compile a unified corpus of more than 100,000 papers from 22 major conferences between 2020 and 2025 and construct a multidimensional profiling pipeline to organize and analyze their textual content. By combining topic clustering, LLM-assisted parsing, and structured retrieval, we derive a comprehensive representation of research activity that supports the study of topic lifecycles, methodological transitions, dataset and model usage patterns, and institutional research directions. Our analysis highlights several notable shifts, including the growth of safety, multimodal reasoning, and agent-oriented studies, as well as the gradual stabilization of areas such as neural machine translation and graph-based methods. These findings provide an evidence-based view of how AI research is evolving and offer a resource for understanding broader trends and identifying emerging directions. Code and dataset: https://github.com/xzc-zju/Profiling_Scientific_Literature
翻译:机器学习、视觉与语言研究领域的快速扩张产生了海量出版物,使得研究成果的整合日益困难。传统的文献计量工具主要依赖元数据,对论文语义内容的可见性有限,难以追踪研究主题随时间的演变规律或不同领域间的相互影响。为更清晰地把握近期发展动态,我们构建了一个包含2020年至2025年间22个重要会议的10万余篇论文的统一语料库,并设计了一套多维知识图谱构建流程来组织分析其文本内容。通过结合主题聚类、大语言模型辅助解析与结构化检索技术,我们建立了支持研究主题生命周期分析、方法论变迁研究、数据集与模型使用模式挖掘以及机构研究方向探索的综合表征体系。分析结果揭示了若干显著趋势转变,包括安全性研究、多模态推理与智能体导向研究的快速增长,以及神经机器翻译和图学习方法等领域的逐步成熟。这些发现为理解人工智能研究的演进路径提供了实证依据,并为把握宏观趋势、识别新兴方向提供了资源支持。代码与数据集:https://github.com/xzc-zju/Profiling_Scientific_Literature