As Large Language Models reshape the global labor market, policymakers and workers need empirical data on which occupational skills may be most susceptible to automation. We present the Skill Automation Feasibility Index (SAFI), benchmarking four frontier LLMs -- LLaMA 3.3 70B, Mistral Large, Qwen 2.5 72B, and Gemini 2.5 Flash -- across 263 text-based tasks spanning all 35 skills in the U.S. Department of Labor's O*NET taxonomy (1,052 total model calls, 0% failure rate). Cross-referencing with real-world AI adoption data from the Anthropic Economic Index (756 occupations, 17,998 tasks), we propose an AI Impact Matrix -- an interpretive framework that positions skills along four quadrants: High Displacement Risk, Upskilling Required, AI-Augmented, and Lower Displacement Risk. Key findings: (1) Mathematics (SAFI: 73.2) and Programming (71.8) receive the highest automation feasibility scores; Active Listening (42.2) and Reading Comprehension (45.5) receive the lowest; (2) a "capability-demand inversion" where skills most demanded in AI-exposed jobs are those LLMs perform least well at in our benchmark; (3) 78.7% of observed AI interactions are augmentation, not automation; (4) all four models converge to similar skill profiles (3.6-point spread), suggesting that text-based automation feasibility may be more skill-dependent than model-dependent. SAFI measures LLM performance on text-based representations of skills, not full occupational execution. All data, code, and model responses are open-sourced.
翻译:随着大语言模型重塑全球劳动力市场,政策制定者与从业者亟需关于哪些职业技能最易被自动化替代的实证数据。本研究提出技能自动化可行性指数(SAFI),对美国劳工部O*NET分类体系中全部35项技能的263项基于文本的任务,对四种前沿大语言模型(LLaMA 3.3 70B、Mistral Large、Qwen 2.5 72B、Gemini 2.5 Flash)进行了基准测试(合计1,052次模型调用,失败率为0%)。通过交叉比对Anthropic经济指数中涵盖756个职业、17,998项任务的真实世界AI应用数据,我们提出AI影响矩阵——一个将技能定位在四个象限(高替代风险、需技能提升、AI增强、低替代风险)中的解释框架。主要发现:(1)数学(SAFI:73.2分)和编程(71.8分)的自动化可行性评分最高,而主动倾听(42.2分)和阅读理解(45.5分)评分最低;(2)存在"能力-需求倒挂"现象,即AI渗透岗位中需求最迫切的技能,恰是LLMs在本基准测试中表现最差的技能;(3)78.7%的AI实际交互属于增强而非替代;(4)四个模型的技能评估曲线趋于一致(评分差异仅3.6个百分点),表明基于文本的自动化可行性可能更取决于技能本身而非模型差异。SAFI衡量的是LLMs在技能文本表征任务上的表现,而非完整的职业执行能力。所有数据、代码及模型响应均已开源。