Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages
Quick Take
A two-stage retriever using a Spanish biomedical encoder outperforms BioBERT-ST on clinical coding retrieval across multiple languages, achieving R@5 scores of 0.822 overall and 0.829 for Portuguese. This study demonstrates the potential of large generative language models to enhance non-English clinical search performance.
Key Points
- Bi-encoder matches BioBERT-ST on MRR (0.876 vs. 0.866) without English pretraining.
- Cross-encoder reranker improves R@5 to 0.822, with gains in four out of five languages.
- Portuguese retrieval reaches R@5 of 0.829, significantly better than BioBERT-ST's 0.714.
- Study provides an open recipe for building domain-specific medical retrievers.
- Learning gain quantified at +15.9% with ~19,500 synthetic pairs.
Article Content
From source RSS / original summaryarXiv:2605. 30529v1 Announce Type: new Abstract: Sentence-embedding models for semantic search are overwhelmingly developed and evaluated on English corpora. When applied to clinical retrieval in other languages -- particularly retrieval of ICD-10-CM / CIE-10 codes -- recall degrades in ways often masked by aggregate benchmarks. We study whether large generative language models can serve as data factories to close this gap.
We build a two-stage retriever (bi-encoder followed by cross-encoder reranker), fine-tuned from a Spanish biomedical encoder (PlanTL-GOB-ES/bsc-bio-ehr-es) on Gemini-generated synthetic data covering English, Spanish, Catalan, Italian, Portuguese and French, and evaluate against BioBERT-ST and the un-tuned Spanish encoder. The bi-encoder alone matches BioBERT-ST on MRR (0. 876 vs. 0. 866) and overtakes it on R@3 (0. 650 vs. 0. 626) and R@5 (0. 804 vs. 0. 790) without English biomedical pretraining.
Adding a cross-encoder reranker lifts aggregate R@5 to 0. 822 and dominates on four of five languages (+0. 017 Spanish, +0. 033 Catalan, +0. 018 French, +0. 037 Portuguese) at the cost of a small English regression. The trade-off is clinically acceptable: Portuguese reaches R@5 = 0. 829 vs. BioBERT-ST's 0. 714. Contributions: an open recipe for building domain-specific medical retrievers from LLM-generated data; quantification of the learning gain (MRR 0. 755 to 0. 876, +15.
9% with ~19,500 synthetic pairs); and a characterisation of where gains concentrate by language and rank.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.