RISEI Working Paper + Policy Brief

How (un)Stable Are LLM Occupational Exposure Scores?

Every major forecast about which jobs AI will eliminate comes from asking AI to rate itself. We found the answer depends entirely on which AI you ask. If the measurement is unstable, the policies built on it are too.

Michelle Yin, Hoa Vu, & Claudia Persico

RISEI Working Paper No. 4152026 · April 2026
Northwestern University & American University
JEL: J23, J24, O33, C81

3.6×
Exposure score divergence across AI models
57%
Worst-case pairwise agreement (Gemini vs. Claude) on identical tasks
2.4×
Job-loss estimate variation by model choice
±
County results flip from "job loss" to "no effect" by model

Key Findings

AI Exposure Scores Are Highly Fragile

Replicating the dominant rubric (Eloundou et al., 2024) with three frontier models on all 18,797 O*NET tasks, mean exposure diverges 3.6-fold. One model rated 14% of tasks as directly exposed; another rated 51%. Cohen's kappa = 0.36, indicating poor agreement.

Downstream Conclusions Flip

In difference-in-differences employment regressions, individual-level coefficient magnitudes vary 2.4-fold across annotators. At the county level, one model shows significant job losses while others show no significant effect. The research conclusion depends on which AI was asked.

Adoption Drives Capability Measurement

Occupations with higher observed AI usage show significantly larger increases in measured exposure across model generations (coefficient = 0.335, p < 0.05). The measurement instrument evolves with adoption, creating a feedback loop that systematically underrepresents communities with lower AI access.

Global Policy Built on Narrow Data

Only 16.3% of the global population has ever used generative AI. The BLS, OECD, ILO, IMF, and WEF all use these scores for employment projections affecting billions. We are making policy for the whole world based on data from 1 in 6 people.

Figure 3: Mean E1 Exposure by Occupation Group and AI Model

Even the broad rank ordering of sectors is not preserved across annotators. Computer and Mathematical occupations are consistently the most exposed, but other groups shift position substantially. Absolute levels diverge even more — ranging from 0.14 to 0.51 for the same occupations.

Figure 3: Mean E1 exposure by SOC major group and annotator, showing 3.6-fold divergence across ChatGPT-4, Gemini 2.5, ChatGPT-5, and Claude 4.5
How to cite this figure:
Yin, M., Vu, H., & Persico, C. (2026). Figure 3: Mean E1 exposure by SOC major group and annotator. In How (un)stable are LLM occupational exposure scores? Evidence from multi-model replication (RISEI Working Paper No. 4152026, p. 32). Northwestern University.

https://michelleyin.org/pub-ai-measurement.html#figure3

Read the Paper & Policy Brief

Working Paper — Yin, Vu & Persico (2026)
Policy Brief No. 2026-03 — Adaptive Precision Framework

The Adaptive Precision Framework

If AI capabilities are a moving target, adoption varies enormously, and the measurement instruments are circular, what is the alternative? The companion policy brief proposes Adaptive Precision: using AI-enabled real-time data to continuously recalibrate what we teach, how we hire, how we design jobs, and how we deliver services.

Personalized Learning

Curricula adjusted each semester based on sector-specific adoption data.

Personalized Hiring

Assessment rubrics recalibrated as new populations adopt AI tools.

Personalized Job Design

Task bundles restructured continuously as AI capabilities evolve.

Personalized Services

Workforce development tailored to each community's adoption landscape.

All Publications Research & Projects RISEI Lab