LLM annotation errors resist correction: study finds only 34.8% fixable via prompting
New research finds only 34.8% of LLM annotation mistakes can be fixed by adding context to prompts, with high-confidence errors proving especially sticky across toxicity detection tasks.

A preprint posted to HuggingFace this week shows that large language models struggle to correct their own annotation mistakes even when given additional context. Researchers at Caltech and the Allen Institute tested dense and mixture-of-experts models on toxicity detection across social media, gaming, news, and forum datasets. Nearly two-thirds of zero-shot errors proved resistant to correction through prompt engineering, with an overall rescue rate of just 34.8%.
The team introduced a metric called Definition-Specific Familiarity (DSF), which measures how well a model's internal concept aligns with the task definition provided in the prompt. After controlling for dataset-level confounds, DSF showed a positive association with performance (partial r = +0.41). Three separate memorization metrics — ROUGE-L, BERTScore, and embedding cosine similarity — all failed to show a positive relationship with accuracy. The finding suggests that raw text-level memorization of training data matters less than whether the model's learned concept matches the user's intended definition.
Decision stickiness
High-confidence errors were especially resistant to correction. When models made a mistake with high certainty in the zero-shot setting, adding clarifying information to the prompt rarely changed the output. The paper also tested what happens when researchers deliberately provide misaligned task definitions — instructions that conflict with standard usage. Models followed the misaligned definitions while maintaining the same confidence levels as when given aligned definitions, indicating they don't signal uncertainty when asked to apply unfamiliar criteria.
The experiments spanned multiple toxicity detection datasets, each with different norms and edge cases. A model might correctly flag slurs in one domain but miss context-dependent toxicity in another, and prompting with examples from the missed cases often failed to shift the boundary. The authors argue this "decision stickiness" limits the reliability of LLM-as-a-judge workflows, especially in domains where the model's training distribution diverges from the annotation task's ground truth.






