TransitLM: 13M transit records train LLMs to route without maps
A 13-million-record corpus from four Chinese cities lets language models generate bus and subway routes end-to-end, grounding GPS coordinates to stations with no explicit mapping layer.
TransitLM is a transit route planning dataset covering 120,845 stations and 13,666 lines across four Chinese cities. The corpus includes over 13 million route planning records formatted for continual pre-training of large language models, alongside benchmark tasks that test whether a model can produce structurally valid routes from origin-destination pairs. Experiments show that an LLM trained on TransitLM generates accurate multi-leg transit itineraries and implicitly maps arbitrary GPS coordinates to the correct stations without any structured map data or routing engine.
Traditional transit planning systems rely on graph databases, GTFS feeds, and complex pathfinding algorithms that require careful maintenance of station metadata, line schedules, and transfer rules. TransitLM bypasses that stack entirely — the model learns station names, line connections, and coordinate-to-station grounding purely from the training data. The authors structured the corpus as both a continual pre-training dataset and a benchmark suite with three evaluation tasks: exact route match, station sequence accuracy, and coordinate grounding precision. The coordinate grounding capability is particularly notable: given an arbitrary GPS pair, the trained LLM identifies the nearest appropriate stations and generates a valid route between them without ever seeing an explicit lat-lon-to-station lookup table. This suggests that large language models can internalize spatial relationships from textual route descriptions alone. The dataset and evaluation code are available at huggingface.co/datasets/GD-ML/TransitLM and github.com/HotTricker/TransitLM, released May 22, 2026.
