ZenCreator

Pro-grade AI content creation. Image, video, face-swap, lipsync, and upscaling behind one API.

14 tools

Up to 4K

4.4(288)

Visit

Loading…

LLM annotation errors resist correction: study finds only 34.8% fixable via prompting | UncensoredHub

ReleasesResearch

LLM annotation errors resist correction: study finds only 34.8% fixable via prompting

New research finds only 34.8% of LLM annotation mistakes can be fixed by adding context to prompts, with high-confidence errors proving especially sticky across toxicity detection tasks.

ByAlex Sokoloff·June 13, 2026

LLM annotation errors resist correction: study finds only 34.8% fixable via prompting

A preprint posted to HuggingFace this week shows that large language models struggle to correct their own annotation mistakes even when given additional context. Researchers at Caltech and the Allen Institute tested dense and mixture-of-experts models on toxicity detection across social media, gaming, news, and forum datasets. Nearly two-thirds of zero-shot errors proved resistant to correction through prompt engineering, with an overall rescue rate of just 34.8%.

The team introduced a metric called Definition-Specific Familiarity (DSF), which measures how well a model's internal concept aligns with the task definition provided in the prompt. After controlling for dataset-level confounds, DSF showed a positive association with performance (partial r = +0.41). Three separate memorization metrics — ROUGE-L, BERTScore, and embedding cosine similarity — all failed to show a positive relationship with accuracy. The finding suggests that raw text-level memorization of training data matters less than whether the model's learned concept matches the user's intended definition.

Decision stickiness

High-confidence errors were especially resistant to correction. When models made a mistake with high certainty in the zero-shot setting, adding clarifying information to the prompt rarely changed the output. The paper also tested what happens when researchers deliberately provide misaligned task definitions — instructions that conflict with standard usage. Models followed the misaligned definitions while maintaining the same confidence levels as when given aligned definitions, indicating they don't signal uncertainty when asked to apply unfamiliar criteria.

The experiments spanned multiple toxicity detection datasets, each with different norms and edge cases. A model might correctly flag slurs in one domain but miss context-dependent toxicity in another, and prompting with examples from the missed cases often failed to shift the boundary. The authors argue this "decision stickiness" limits the reliability of LLM-as-a-judge workflows, especially in domains where the model's training distribution diverges from the annotation task's ground truth.

ZenCreator

LLM annotation errors resist correction: study finds only 34.8% fixable via prompting

Decision stickiness

More in Releases

PAJAMA distills LLM judges into programs, cuts eval cost by 100×

Molt: NVIDIA's PyTorch framework cuts agentic RL iteration cost

Hypernetworks outscale LoRA for train-time knowledge injection in LLMs

Staleness-Adaptive Trust Region cuts asynchronous RL performance loss to 3% at 8× policy lag

Distilled RL transfers knowledge across model families without unconditional imitation