5ting at SemEval-2026 Task 8: Strong End-to-End Multi-Turn RAG via LLM-Based Reranking and Faithfulness Control
Quick Answer
This paper shows that The 5ting system for SemEval-2026 Task 8 integrates BGE-M3 dense retrieval and LLM-based reranking to enhance multi-turn Retrieval Augmented Generation (RAG).
Quick Take
The 5ting system for SemEval-2026 Task 8 integrates BGE-M3 dense retrieval and LLM-based reranking to enhance multi-turn Retrieval Augmented Generation (). It achieved an nDCG@5 score of 0.4719 and a harmonic score of 0.5597 in evaluations, demonstrating effective evidence-based generation.
Key Points
- 5ting combines BGE-M3 retrieval with FAISS indexing for improved performance.
- The system addresses context drift, under specification, and hallucination risks.
- Achieved nDCG@5 score of 0.4719 in Task A of MTRAGEval.
- End-to-end system ranked with a harmonic score of 0.5597 in Task C.
- RL_F score of 0.7692 indicates strong faithfulness control.
Paper Resources
📖 Reader Mode
~1 min readAbstract:We introduce 5ting, our system for the SemEval2026 Task 8 (MTRAGEval), which evaluates multi-turn Retrieval Augmented Generation (RAG) systems. Multi turn RAG involves context drift, under specification, and hallucination risk. Our system combines BGE-M3 dense retrieval with FAISS indexing, dual-query merged retrieval, and LLM based reranking, followed by role separated generation constrained to retrieved evidence. The retriever achieved nDCG@5 = 0.4719 in Task A, while the end to end system ranked in Task C with a harmonic score of 0.5597 and RL_F = 0.7692.
| Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI) |
| Cite as: | arXiv:2606.28737 [cs.CL] |
| (or arXiv:2606.28737v1 [cs.CL] for this version) | |
| https://doi.org/10.48550/arXiv.2606.28737 arXiv-issued DOI via DataCite |
Submission history
From: Thien-Qua T.Nguyen [view email]
[v1]
Sat, 27 Jun 2026 05:13:49 UTC (326 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Quantifying Prior Dominance in Systems
The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.