SEA-Embedding: Open and Reproducible Text… · DeepSignal

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

arXiv cs.CL·Peerat Limkonchotiwat, Raymond Ng, Sarana Nutanong, Jian Gang Ngui

2h ago

·~1 min·6/3/2026·en·0

Quick Take

SEA-Embedding introduces a fully open and reproducible text-embedding pipeline for Southeast Asian languages, achieving state-of-the-art results on SEA-BED. It emphasizes robust design factors like data composition and training objectives, addressing the reproducibility issues of existing models.

Key Points

SEA-Embedding is trained exclusively on publicly available data.
The pipeline allows systematic analysis of robust text embeddings.
It addresses the lack of reproducibility in existing embedding models.
Achieves state-of-the-art performance on the SEA-BED benchmark.
Focuses on critical design factors for robust embeddings.

Article Excerpt

From source RSS / original summary

arXiv:2606. 03027v1 Announce Type: new Abstract: Text embeddings are fundamental to many downstream applications, making robustness important for real-world NLP. However, most recent state-of-the-art embedding models are not reproducible because they rely on closed or undisclosed training data, and they remain insufficiently robust for Southeast Asian languages.

We present SEA-Embedding, a fully open and reproducible text-embedding pipeline for Southeast Asian languages trained only on publicly available data, and use it to study three core factors of robust embedding design: data composition, training objective, and base encoder initialization. SEA-Embedding achieves state-of-the-art results on SEA-BED while enabling systematic and reproducible analysis of robust text embeddings for the region.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

2w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy