Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

arXiv cs.CL·David Rey-Blanco, Roberto Cruz

4h ago

·~1 min·6/1/2026·en·0

Quick Take

A two-stage retriever using a Spanish biomedical encoder outperforms BioBERT-ST on clinical coding retrieval across multiple languages, achieving R@5 scores of 0.822 overall and 0.829 for Portuguese. This study demonstrates the potential of large generative language models to enhance non-English clinical search performance.

Key Points

Bi-encoder matches BioBERT-ST on MRR (0.876 vs. 0.866) without English pretraining.
Cross-encoder reranker improves R@5 to 0.822, with gains in four out of five languages.
Portuguese retrieval reaches R@5 of 0.829, significantly better than BioBERT-ST's 0.714.
Study provides an open recipe for building domain-specific medical retrievers.
Learning gain quantified at +15.9% with ~19,500 synthetic pairs.

Article Content

From source RSS / original summary

arXiv:2605. 30529v1 Announce Type: new Abstract: Sentence-embedding models for semantic search are overwhelmingly developed and evaluated on English corpora. When applied to clinical retrieval in other languages -- particularly retrieval of ICD-10-CM / CIE-10 codes -- recall degrades in ways often masked by aggregate benchmarks. We study whether large generative language models can serve as data factories to close this gap.

We build a two-stage retriever (bi-encoder followed by cross-encoder reranker), fine-tuned from a Spanish biomedical encoder (PlanTL-GOB-ES/bsc-bio-ehr-es) on Gemini-generated synthetic data covering English, Spanish, Catalan, Italian, Portuguese and French, and evaluate against BioBERT-ST and the un-tuned Spanish encoder. The bi-encoder alone matches BioBERT-ST on MRR (0. 876 vs. 0. 866) and overtakes it on R@3 (0. 650 vs. 0. 626) and R@5 (0. 804 vs. 0. 790) without English biomedical pretraining.

Adding a cross-encoder reranker lifts aggregate R@5 to 0. 822 and dominates on four of five languages (+0. 017 Spanish, +0. 033 Catalan, +0. 018 French, +0. 037 Portuguese) at the cost of a small English regression. The trade-off is clinically acceptable: Portuguese reaches R@5 = 0. 829 vs. BioBERT-ST's 0. 714. Contributions: an open recipe for building domain-specific medical retrievers from LLM-generated data; quantification of the learning gain (MRR 0. 755 to 0. 876, +15.

9% with ~19,500 synthetic pairs); and a characterisation of where gains concentrate by language and rank.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

1w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy