BM25 lexical search outperforms dense retrievers in agentic research tasks
New research shows a simple lexical retriever can match dense-retrieval search agents when paired with frontier LLMs, challenging assumptions about agentic search architecture.

Pi-Serini, a search agent from researchers at the University of Waterloo, pairs BM25—a decades-old lexical retriever—with frontier LLMs to answer deep research questions. The system equips the agent with three tools: document retrieval, browsing, and reading. On BrowseComp-Plus, a benchmark for multi-hop research tasks, Pi-Serini with GPT-5.5 achieves 83.1% answer accuracy and 94.7% surfaced evidence recall, outperforming released search agents that rely on dense retrievers.
The findings challenge the assumption that agentic search requires neural embedding models. Controlled ablations show that tuning BM25 parameters improves answer accuracy by 18.0 percentage points and surfaced evidence recall by 11.1 points over the default configuration. Increasing retrieval depth—the number of documents the agent can examine—boosts surfaced evidence recall by 25.3 points compared to shallow retrieval. As LLMs gain stronger reasoning and tool-use abilities, the retriever's job simplifies: it needs only to surface relevant documents at sufficient depth, not rank them perfectly. Source code is available at github.com/justram/pi-serini.