Financial BERT achieves 59% accuracy on numerical triplet sorting with log-magnitude embeddings
A researcher fine-tuned ModernBERT with a custom number-aware tokenizer that encodes magnitudes into 128 bins, improving numerical ordering accuracy from 38% to 59% on triplet-sorting tasks.
Financial BERT is a number-aware embedding model that encodes numerical magnitudes into 128 smoothly interpolated bins instead of treating digits as discrete tokens. The approach addresses a core weakness in standard embedding models: when you compare the cosine similarity of "a 500 hp car", "a 1,200 hp car", and "a 73 hp car", most models — including Qwen and ModernBERT-based embeddings — fail to preserve numerical ordering. The root cause is tokenizer design and the log-likelihood loss used during masked language model pretraining, which rewards exact token prediction over order-of-magnitude reasoning.
The solution uses regex to detect number patterns, then represents each number in log magnitude and encodes it across 128 embedding bins via linear interpolation. The decoding head is a classification-regression layer with 128 output bins and a smooth cross-entropy loss. After six H100-hours of masked language model fine-tuning on 300 million tokens (including roughly 4 million numbers), the model correctly sorts triplets of sentences 59 percent of the time — a jump from 38 percent for standard ModernBERT mean-pooling and 34 percent for BGE-base-v1.5 CLS pooling. The model also extracts structured data from number-heavy HTML tables more reliably than general-purpose embeddings.
The weights are available on HuggingFace, and a technical blog post details the year-long engineering effort behind the approach.
