SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia
Quick Take
SEA-Embedding introduces a fully open and reproducible text-embedding pipeline for Southeast Asian languages, achieving state-of-the-art results on SEA-BED. It emphasizes robust design factors like data composition and training objectives, addressing the reproducibility issues of existing models.
Key Points
- SEA-Embedding is trained exclusively on publicly available data.
- The pipeline allows systematic analysis of robust text embeddings.
- It addresses the lack of reproducibility in existing embedding models.
- Achieves state-of-the-art performance on the SEA-BED benchmark.
- Focuses on critical design factors for robust embeddings.
Article Excerpt
From source RSS / original summaryarXiv:2606. 03027v1 Announce Type: new Abstract: Text embeddings are fundamental to many downstream applications, making robustness important for real-world NLP. However, most recent state-of-the-art embedding models are not reproducible because they rely on closed or undisclosed training data, and they remain insufficiently robust for Southeast Asian languages.
We present SEA-Embedding, a fully open and reproducible text-embedding pipeline for Southeast Asian languages trained only on publicly available data, and use it to study three core factors of robust embedding design: data composition, training objective, and base encoder initialization. SEA-Embedding achieves state-of-the-art results on SEA-BED while enabling systematic and reproducible analysis of robust text embeddings for the region.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.