EverydayGPT: Confidence-Gated Routing for Efficient and Safe Hybrid GPT-RAG Conversational QA
Quick Answer
EverydayGPT introduces a Confidence-Gated Routing mechanism to optimize conversational QA, reducing latency by over 120x for 85% of queries.
Quick Take
EverydayGPT introduces a Confidence-Gated Routing mechanism to optimize conversational QA, reducing latency by over 120x for 85% of queries. With a 205M-parameter GPT model trained on 10B tokens, it achieves an F1 score of 0.226 on a 500-question benchmark, outperforming traditional and GPT-only systems in efficiency.
Key Points
- EverydayGPT uses Confidence-Gated Routing to enhance efficiency in QA systems.
- 85% of queries resolved via fast RAG extraction, reducing latency to ~45 ms.
- Achieves F1 score of 0.226 on a 500-question benchmark, outperforming GPT-only systems.
- Substantial efficiency improvements with 6.3x mean latency reduction noted.
- Study focuses on routing strategies under resource constraints, not state-of-the-art claims.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11212v1 Announce Type: new Abstract: Standard (RAG) pipelines route every query through retrieval and generation unconditionally, incurring unnecessary computation and propagating low-quality context to the generator. We introduce EverydayGPT, a lightweight conversational QA system built around a Confidence-Gated Routing (CGR) mechanism that formalises the routing decision as a joint policy over retrieval distance and extraction adequacy.
The backbone is a 205M-parameter GPT trained from scratch on 10B tokens of FineWeb-Edu. CGR avoids invoking the costly GPT pathway (~5. 9s) for 85 percent of queries by resolving them via fast RAG extraction (~45 ms), yielding over 120x latency reduction on the majority of queries while maintaining answer quality. On a 500-question in-domain benchmark, the system achieves F1 = 0. 226 +/- 0. 004 compared to 0. 171 for GPT-only and 0. 210 for unconditional RAG.
Gains over strong baselines are modest but consistent, while efficiency improvements are substantial (6. 3x mean latency reduction). A structured grounding audit finds no unsupported claims in the sampled set, with explicit scope limitations. We position this work as a study of routing strategies under resource constraints rather than a claim of state-of-the-art performance.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.