Causal graphs reveal how LLMs organize concepts during inference
Researchers map LLM reasoning with causal graphs and counterfactual chains, revealing class-discriminative concept dependencies across diagnosis, sentiment, and judge tasks.

A new preprint describes a method for building causal graphs that expose how large language models organize high-level concepts during inference. The four-phase pipeline discovers interpretable concepts from text examples, maps inputs to LLM-perceived concept states, and uses MCMC-inspired counterfactual augmentation to stabilize causal discovery. The resulting graphs show which concepts the model treats as causes and which as effects when producing a prediction.
The authors tested the approach on three LLMs across disease diagnosis, sentiment analysis, and LLM-as-a-judge classification tasks. They evaluated the learned graphs for predictive fidelity—how well the graph reproduces the model's outputs—and structural stability under resampling. The counterfactual augmentation procedure expands sparse observational data by generating chains of counterfactuals, which the paper shows converge and improve downstream causal discovery with the σ-CG algorithm.
Building the causal graph
The pipeline starts by prompting the target LLM to propose class-discriminative concepts for a given task. It then maps each input example to binary concept states as perceived by the LLM. Because observational data alone is too sparse for reliable causal discovery, the method generates counterfactual examples by flipping one concept at a time and asking the LLM to rewrite the input accordingly. An MCMC-inspired procedure chains these flips, producing a richer dataset that captures how concept changes propagate. The final step applies σ-CG, a causal discovery algorithm, to the augmented data, yielding a directed acyclic graph of concept dependencies.
Unlike prior work that uses LLMs to recover causal graphs of external-world processes, this approach treats the LLM itself as the system under study. The authors report that the discovered graphs align with expected reasoning patterns in each domain—disease diagnosis, sentiment classification, and judgment tasks—providing a foundation for concept-level explainability. The paper was authored by Nirit Nussbaum-Hoffer, Nitay Calderon, Liat Ein-Dor, and Roi Reichart.


