IndicMedDialog brings medical AI to a billion Indic-language speakers

Researchers release a parallel medical dialogue dataset spanning English and nine Indic languages, paired with a fine-tuned model for symptom elicitation across multilingual patient consultations.

May 14, 2026

IndicMedDialog brings medical AI to a billion Indic-language speakers

IndicMedDialog is a parallel multi-turn medical dialogue dataset covering English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. Researchers Shubham Kumar Nigam, Suparnojit Sarkar, and Piyush Patel built the dataset by extending MDDial with LLM-generated synthetic consultations, translating them via TranslateGemma, then verifying each dialogue with native speakers and refining through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors common in non-Latin scripts.

Most medical AI systems operate in single-turn question–answering modes or rely on template-based datasets, limiting conversational realism and multilingual reach. The Indic language family accounts for over a billion native speakers across South Asia, yet medical dialogue datasets in these languages remain scarce. IndicMedDialog addresses that gap by building parallel data—the same consultation appears in all ten languages, enabling direct cross-lingual comparison and transfer learning. The team also fine-tuned IndicMedLM, a parameter-efficient adaptation of a quantized small language model that incorporates optional patient pre-context to personalize multi-turn symptom elicitation, allowing the model to ask follow-up questions that reflect prior information rather than starting from scratch each turn.

Evaluations include comparisons against zero-shot multilingual baselines, systematic error analysis across all ten languages, and clinical plausibility checks through medical expert review. The preprint posted on arXiv on May 14, 2026.

More in Releases