Music recommender system lifts recall 95% by fusing lyrics, audio, and LLM metadata
A new framework enriches LastFM-1K with audio embeddings, lyric vectors, and LLM-generated semantic tags, lifting session-based music recommendation recall by up to 95 percent over ID-only baselines.

A team of researchers has released a multimodal music recommendation framework that treats songs as rich content bundles rather than opaque interaction tokens. The approach, detailed in a preprint published this week, augments the LastFM-1K dataset with audio embeddings, lyric representations, LLM-generated semantic metadata, and listening completion ratios, then feeds those signals into an extended version of the E4SRec sequential recommendation architecture. Experiments show recall improvements of up to 95 percent and NDCG gains of up to 79 percent over ID-only baselines.
Most music recommenders rely on collaborative filtering—user A played song B, user C also played B, so recommend B to user D—ignoring what the song actually sounds like or says. Prior work has explored LLM-augmented and text-enhanced sequential recommendation, but none jointly modeled acoustic content, lyric semantics, and engagement signals within a unified LLM-based reasoning framework. The new system extracts audio and lyric embeddings using pretrained music and text representation models, generates semantic metadata via the MGPHot annotation schema (prompting an LLM to tag mood, genre, and instrumentation), and logs listening completion ratios as a proxy for user satisfaction. Those features are concatenated and passed to item-ID encoder backbones including SASRec, BERT4Rec, and GRU4Rec, with LLM backbone options spanning LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings.
On benchmarks
The authors tested on LastFM-1K, a session-based listening dataset. Integrating content-based features lifted recall by as much as 95 percent and NDCG by 79 percent compared to ID-only models. However, naive multimodal fusion—simply concatenating all embeddings—did not always yield additive improvements; some feature combinations degraded performance, highlighting cross-modal integration challenges. The team notes that acoustic and lyric embeddings often capture overlapping information, and that careful feature selection and fusion strategies matter more than throwing every signal into the mixer.
The authors have released the enriched LastFM-1K dataset and the extended E4SRec codebase as a large-scale multimodal benchmark for music recommendation. Grounding recommendations in actual song content—what the track sounds like, what the lyrics say, and how users engage with it—can substantially outperform collaborative filtering alone, though cross-modal fusion remains an open research problem.

