5ting at SemEval-2026 Task 8: Strong End-to-End Multi-Turn RAG via LLM-Based Reranking and Faithfulness Control

arXiv cs.CL· Thien-Qua-T-Nguyen, Chi Hoang, Nguyen Tran, Tri Le, Khanh Truong, Chinh Trong Nguyen

1d ago

·~1 min·6/30/2026·en·0

Quick Answer

This paper shows that The 5ting system for SemEval-2026 Task 8 integrates BGE-M3 dense retrieval and LLM-based reranking to enhance multi-turn Retrieval Augmented Generation (RAG).

Quick Take

The 5ting system for SemEval-2026 Task 8 integrates BGE-M3 dense retrieval and LLM-based reranking to enhance multi-turn Retrieval Augmented Generation (). It achieved an nDCG@5 score of 0.4719 and a harmonic score of 0.5597 in evaluations, demonstrating effective evidence-based generation.

Key Points

5ting combines BGE-M3 retrieval with FAISS indexing for improved performance.
The system addresses context drift, under specification, and hallucination risks.
Achieved nDCG@5 score of 0.4719 in Task A of MTRAGEval.
End-to-end system ranked with a harmonic score of 0.5597 in Task C.
RL_F score of 0.7692 indicates strong faithfulness control.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~1 min read

[Submitted on 27 Jun 2026]

View PDF HTML (experimental)

Abstract:We introduce 5ting, our system for the SemEval2026 Task 8 (MTRAGEval), which evaluates multi-turn Retrieval Augmented Generation (RAG) systems. Multi turn RAG involves context drift, under specification, and hallucination risk. Our system combines BGE-M3 dense retrieval with FAISS indexing, dual-query merged retrieval, and LLM based reranking, followed by role separated generation constrained to retrieved evidence. The retriever achieved nDCG@5 = 0.4719 in Task A, while the end to end system ranked in Task C with a harmonic score of 0.5597 and RL_F = 0.7692.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2606.28737 [cs.CL]
	(or arXiv:2606.28737v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2606.28737 arXiv-issued DOI via DataCite

Submission history

From: Thien-Qua T.Nguyen [view email]
[v1] Sat, 27 Jun 2026 05:13:49 UTC (326 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1w ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

5ting at SemEval-2026 Task 8: Strong End-to-End Multi-Turn RAG via LLM-Based Reranking and Faithfulness Control

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems