RealICU benchmark exposes LLM failure modes in ICU decision support
New benchmark tests language models on intensive-care data with hindsight labels from senior physicians, revealing recall-safety tradeoffs and anchoring bias that persist even in memory-augmented agents.
RealICU is a hindsight-annotated benchmark for evaluating large language models on intensive care unit data, released this week by researchers including Chengzhi Shen, Weixiang Shen, and Tobias Susetzky. The benchmark differs from prior ICU datasets by labeling patient trajectories after senior physicians review the full outcome, rather than treating real-time clinician actions as ground truth. The team argues that historical ICU actions are made under incomplete information and time pressure, making them suboptimal references for measuring AI reasoning. RealICU partitions patient data into 30-minute windows and ships two datasets: RealICU-Gold with 930 annotated windows from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. The four evaluation tasks are Patient Status assessment, Acute Problems identification, Recommended Actions, and Red Flag actions that risk unsafe outcomes.
Existing LLMs, including memory-augmented architectures, performed poorly on the benchmark. The preprint documents two failure modes: a recall-safety tradeoff in which models either miss critical interventions or flag too many false alarms, and an anchoring bias in which early interpretations of the patient state persist even when new data should trigger reassessment. The authors introduce ICU-Evo, a structured-memory agent designed to improve long-horizon reasoning over evolving clinical streams. ICU-Evo reduced but did not eliminate safety failures, suggesting that current architectures still struggle with the temporal reassessment loop that human intensivists perform.
The benchmark is built on MIMIC-IV, a public critical-care database, and the team has published a project page with dataset access and evaluation scripts. The hindsight-labeling approach is resource-intensive — senior physician review for 930 windows — but the Oracle extension to 11,862 windows makes the benchmark large enough for model training and ablation studies. The recall-safety tradeoff is particularly acute in the Red Flag task, where false negatives can delay life-saving interventions and false positives can trigger alert fatigue.
The next step is scaling hindsight annotation beyond MIMIC-IV to multi-site ICU data, which would test whether the failure modes generalize across hospital systems and patient populations. The authors note that ICU-Evo's structured memory reduces anchoring bias but does not solve it, leaving open the question of whether retrieval-augmented architectures or explicit belief-revision modules can close the gap. Until then, RealICU provides a stress test for any LLM claiming readiness for high-stakes clinical decision support.
