A new paper from arXiv, Understanding Annotator Safety Policy with Interpretability by Oesterling, Ren, Assogba, and collaborators, does something deceptively simple: it applies interpretability tools to the humans labeling AI outputs as safe or unsafe, rather than to the models themselves. The finding is uncomfortable. Annotators encode their own subjective readings of policy language into labels, and those encodings are inconsistent in ways that standard inter-annotator agreement scores do not capture.

The Cloudflare Connection: Who Decides What Work Looks Like

The paper lands the same week Cloudflare eliminated 1,100 roles that AI efficiency supposedly replaced. Many of those were support roles, the kind of jobs that require judgment calls about what a user needs, what counts as a valid complaint, what constitutes resolution. These are not mechanical tasks. They are annotation tasks in the broad sense: human interpretation of ambiguous situations applied to structured outputs. The irony is that the safety of the AI models replacing these workers depends on annotation labor that faces the same interpretability problems the arXiv paper identifies. , including Sam Altman's push for GPT-5.5 adoption as infrastructure. The faster models deploy, the more the annotation bottleneck matters.

Agentic Systems and the Authorization Problem

A second arXiv paper this week, Partial Evidence Bench by Krti Tallam, benchmarks what happens when AI agents operate inside scoped retrieval systems with limited authorization. Enterprise agents increasingly make decisions based on partial information, but the benchmarks evaluating them assume complete access. This is a governance gap wearing a technical costume. The Josh Reynolds painting story from Artnet this week, researchers identifying an enslaved boy hidden in plain sight through new historical analysis, is a useful analog: what you see in a system depends entirely on what you are authorized to look for.