LEAF living benchmark tracks LLM forecasting on event-driven stock predictions
Researchers released LEAF, the first living benchmark for event-augmented forecasting, using recursive retrieval and dual-agent validation to feed LLMs multidimensional event data for stock trends, probabilities, and time series prediction.
LEAF is a living benchmark that tests how well large language models forecast real-world outcomes when fed complex event streams. Released on arXiv this week, the preprint addresses a gap in existing LLM forecasting tests: most either lack the multidimensional event data that drives accurate predictions or operate in closed, simplified environments. LEAF tackles both by pairing a recursive retrieval agent with dual-agent cross-validation to supply comprehensive, relevant auxiliary text for forecasting tasks—including future event probabilities, trend forecasting, and time series prediction.
Evaluating state-of-the-art proprietary and open-weight models across stock market prediction tasks, the researchers found that LLMs can extract signals from complex events to improve performance. Models perform better on equities they self-assess as more predictable, and the events themselves correlate strongly with target equities—suggesting the retrieval pipeline surfaces meaningful context rather than noise. Because LEAF updates dynamically, it sidesteps pre-training contamination, a persistent problem when benchmarking models on static datasets they may have already seen during training.
The recursive retrieval design is meant to scale with new data sources, and the dual-agent validation step filters out irrelevant or low-quality event text before it reaches the forecasting model. The stock domain is the initial testbed, but the architecture could extend to other forecasting domains where event context matters—geopolitics, supply chains, public health. Future work should clarify which event types contribute most to forecast accuracy and whether open-weight models can close the gap with proprietary systems when both have access to the same event augmentation.
