Qwen3-8B reaches 77% on FHIR clinical reasoning with reinforcement learning post-training

Researchers applied reinforcement learning to a multi-turn CodeAct agent built on Qwen3-8B, improving accuracy on FHIR-AgentBench from 50% to 77% and outperforming closed models on structured healthcare graph queries.

May 16, 2026

Qwen3-8B reaches 77% on FHIR clinical reasoning with reinforcement learning post-training

A new arXiv preprint demonstrates that reinforcement learning can sharply improve how language models reason over Fast Healthcare Interoperability Resources (FHIR), the dominant standard for exchanging electronic health records. Researchers framed clinical question answering as a sequential decision problem over a queryable graph and showed that a smaller open-weight model, post-trained with RL, beats prompt-based closed models on real-world hospital data. The work centers on FHIR-AgentBench, a benchmark that requires agents to perform multi-step filtering, traversal, and aggregation across multiple resource types in a directed graph of patient records.

The team implemented a multi-turn CodeAct agent on Qwen3-8B and post-trained it using a custom RL harness with execution-grounded rewards from an LLM judge. Prior tool-augmented agents often selected the wrong resources or violated traversal constraints; the RL approach enforces data-integrity rules during training. On FHIR-AgentBench, the post-trained Qwen3-8B reached 77% answer correctness, up from 50% for o4-mini and better than other closed-model baselines, while using fewer parameters and lower inference cost. The paper presents an end-to-end pipeline covering environment construction, harness design, model training, and custom evaluation.

FHIR graphs are notoriously complex—clinical queries can span Observation, Condition, MedicationRequest, and Procedure resources, each with its own schema and referential links. The sequential decision framing lets the agent learn which traversal paths yield correct answers without violating schema constraints, a problem that purely prompt-based systems struggle to solve. The RL reward signal is grounded in whether the agent's generated code executes correctly and returns the right answer, rather than relying on heuristic scoring.

The next step is scaling the approach to larger FHIR datasets and testing generalization across hospital systems with different resource-type distributions. The paper does not yet report how the agent handles out-of-distribution clinical questions or whether the RL policy transfers to non-FHIR structured graphs. If the method holds up on broader EHR schemas, it could become a standard post-training recipe for any domain where LLMs must reason over constrained, multi-hop data structures.

More in Research