NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track
Quick Answer
This paper shows that NightFeats, a multi-agent RAG system, won Best Dynamic Evaluation at NeurIPS 2025, outperforming Claude-SonnetV2 and Nova-Pro.
Quick Take
NightFeats, a multi-agent system, won Best Dynamic Evaluation at NeurIPS 2025, outperforming Claude-SonnetV2 and Nova-Pro. It emphasizes a structured approach with retrieval, curation, and composition phases, achieving better alignment with human preferences over traditional metrics.
Key Points
- NightFeats decomposes knowledge synthesis into retrieval, curation, and composition phases.
- Introduces core primitives like temporal-semantic reranking and citation-preserving composition.
- Outperformed proprietary models on LLM-as-a-Judge and Human Likert evaluations.
- Architectural transparency enhances alignment with human preferences over automatic metrics.
- Awarded Best Dynamic Evaluation in the text-to-text track at NeurIPS 2025.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 11199v1 Announce Type: new Abstract: We present NightFeats, a structured multi-agent (RAG) system submitted to the MMU-RAGent competition at NeurIPS 2025, where it was awarded Best Dynamic Evaluation in the text-to-text track.
Rather than targeting benchmark maximization, this work proposes a principled pipeline that decomposes knowledge synthesis into three coordinated phases: retrieval, curation, and composition, each governed by explicit intermediate representations and handoff contracts. Inspired by Agentic (ACE), the system introduces temporal-semantic reranking, bounded contradiction reconciliation, and citation-preserving composition as core architectural primitives.
Competition results show that NightFeats surpasses proprietary baselines including Claude-SonnetV2 and Nova-Pro on LLM-as-a-Judge and Human Likert evaluations, confirming that architectural transparency and verifiable evidence grounding are better aligned with human preferences than systems optimizing narrowly for automatic similarity metrics.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.