DLLG: Dynamic Logit-Level Gating of LLM Experts
Quick Take
DLLG (Dynamic Logit-Level Gating) enhances LLM performance by dynamically learning token-level expert fusion without requiring token-level labels. It outperforms existing methods like heuristic ensembling and parameter merging across various reasoning and code benchmarks, demonstrating a robust approach to integrating specialized models.
Key Points
- DLLG learns token-level expert fusion from sparse response-level supervision.
- It predicts step-wise fusion weights using a lightweight gating module.
- DLLG consistently outperforms heuristic ensembling and parameter merging.
- The framework is robust and scalable across various model scales.
- Demonstrated effectiveness on diverse reasoning and code benchmarks.
Article Excerpt
From source RSS / original summaryarXiv:2606. 04378v1 Announce Type: new Abstract: Leveraging multiple specialized LLMs can combine complementary strengths, but existing approaches trade adaptability for stability: routing commits prematurely, heuristic ensembling depends on fragile proxies, and parameter merging introduces interference. We propose DLLG (Dynamic Logit-Level Gating), a dynamic logit-level ensembling framework that learns token-level expert fusion from sparse response-level supervision.
A lightweight gating module predicts step-wise fusion weights, linking trajectory-level correctness to generation without token-level labels or expert retraining. Across diverse reasoning and code benchmarks, DLLG consistently outperforms strong routing, heuristic ensembling, and parameter-merging baselines across model scales, highlighting learned logit-level fusion as a robust and scalable paradigm for integrating specialized experts.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.