Featured

Curated AI papers across LLMs, agents, robotics, safety and applied AI, ranked by signal score and practical relevance.

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

3d ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

Why Featured

The development of Procedural Memory Distillation (PMD) in language models like Qwen3-8B and OLMo3-Instruct-7B demonstrates a significant improvement in performance metrics, indicating that builders can leverage this technique for more efficient and effective AI systems. For PMs and investors, this advancement signals a potential competitive edge in the rapidly evolving AI landscape, enhancing the value proposition of products using these models.

#LLM #AI Coding #Inference #Policy

arXiv cs.CL·Yining She, Yiliang Liang, Eunsuk Kang

3d ago

FeaturedOriginal

Safeguarding LLM Agents from Misalignment through Provenance Analysis

AI Summary

ProvenanceGuard, a new framework for LLM agents, reduces misalignment error rates from 42.9% to 1.8% on Agent-SafetyBench and from 32.1% to 17.3% on WorkBench, enhancing alignment with user intent through structured provenance analysis.

Why Featured

The introduction of ProvenanceGuard significantly reduces misalignment error rates in LLM agents, enhancing their alignment with user intent. For builders and PMs, this development means more reliable AI systems that can better meet user needs, while investors should see this as a signal of improved safety and usability in AI applications, potentially increasing market adoption.

#LLM #Agent #Security

Introducing TabFM: A zero-shot foundation model for tabular data

Google Research

6d ago

FeaturedOriginal

Introducing TabFM: A zero-shot foundation model for tabular data

AI Summary

Google Research introduces TabFM, a zero-shot foundation model for tabular data, eliminating manual training and hyperparameter tuning. TabFM leverages in-context learning to generate predictions on unseen tables efficiently, outperforming traditional models in benchmarks across 38 classification and 13 regression datasets.

Why Featured

Google Research's introduction of TabFM, a zero-shot foundation model for tabular data, significantly reduces the need for manual training and hyperparameter tuning, enabling builders and PMs to deploy predictive models faster and at lower costs. This advancement could attract investor interest due to its potential to streamline data-driven decision-making across various industries.

#LLM #Open Source #AI Startup

Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction

Google Research

1w ago

FeaturedOriginal

Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction

AI Summary

Google Research has accelerated the Gemini Nano models on Pixel devices by implementing frozen Multi-Token Prediction, significantly enhancing performance. This advancement allows for faster processing and improved efficiency in AI tasks, benefiting developers and users of Pixel devices. The new approach aims to reduce computational costs while maintaining high accuracy in predictions.

Why Featured

Google Research's acceleration of Gemini Nano models on Pixel devices through frozen Multi-Token Prediction enhances processing speed and efficiency, which is crucial for builders and PMs focusing on mobile AI applications. This development signals a reduction in computational costs while maintaining accuracy, making it a compelling opportunity for investors in the AI and mobile tech sectors.

#LLM #AI Coding #Inference #AI Assistant

arXiv cs.CL·Amirreza Esmaeili, Fatemeh Fard

3d ago

FeaturedOriginal

TokenScope: Token-Level Explainability and Interpretability for Code-Oriented Tasks in Large Language Models

AI Summary

TokenScope is an interactive tool designed for decoder-based large language models (LLMs) that enhances token-level explainability during code generation. It integrates decoding-time signals with structural program analysis, allowing for interactive token replacement and exploration of alternative generation paths, thereby improving understanding of LLM behavior.

Why Featured

TokenScope enhances token-level explainability for decoder-based large language models during code generation, allowing builders and PMs to better understand model behavior and improve output quality. This tool's interactive features can lead to more efficient debugging and optimization processes, making it a valuable asset for developers and investors focused on AI-driven coding solutions.

#LLM #AI Coding #Open Source

arXiv cs.CL·Zhiyun Zhang, Liwen Sun, Xiang Qian, Chenyan Xiong

3d ago

FeaturedOriginal

FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning

AI Summary

FaithMed enhances medical reasoning by integrating clinician-designed rubrics with reinforcement learning, achieving a 9% improvement over agentic-search baselines and a 15.5% increase in evidence-based rubric scores across seven benchmarks. This framework ensures transparent, evidence-grounded clinical decisions.

Why Featured

FaithMed's integration of clinician-designed rubrics with reinforcement learning demonstrates a significant advancement in evidence-based medical reasoning, improving decision-making transparency and accuracy. This development signals to builders and PMs the potential for AI systems that enhance clinical workflows, while investors should note the growing demand for reliable AI solutions in healthcare.

#LLM #Agent #AI Assistant #Enterprise AI

arXiv cs.CL·M. K. Arabov

3d ago

FeaturedOriginal

RusFinChain: A Russian Benchmark for Verifiable Chain-of-Thought Reasoning in Finance with Fuzzy-Aligned Evaluation

AI Summary

RusFinChain is the first Russian-language benchmark for verifiable Chain-of-Thought reasoning in finance, featuring 5,280 examples across 17 domains. Evaluation of 8 open-weight LLMs shows a Hard F1 score of ~0.65 for step alignment, but only ~29% of final answers are correct, highlighting a significant reasoning gap.

Why Featured

The launch of RusFinChain, the first Russian-language benchmark for Chain-of-Thought reasoning in finance, highlights the need for improved reasoning capabilities in financial AI applications. With a Hard F1 score of ~0.65 but only ~29% accuracy in final answers, builders and PMs should prioritize enhancing model performance to meet industry standards, while investors may see opportunities in developing solutions that address these gaps.

#LLM #AI Coding #Policy

arXiv cs.AI·Hongyang He, Jiuming Liu, Victor Sanchez

3d ago

FeaturedOriginal

Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning

AI Summary

The paper introduces Semi-CoT, a semi-supervised learning framework leveraging unlabeled questions to generate pseudo reasoning chains for large language models. Experiments on benchmarks like AQuA and GSM8K show pseudo-answer precision between 91.36% and 100%, indicating potential for effective reasoning signal generation, though challenges remain in demonstration selection.

Why Featured

The introduction of the Semi-CoT framework for semi-supervised chain-of-thought learning allows builders and PMs to leverage unlabeled data for improving the reasoning capabilities of language models, potentially reducing the need for extensive labeled datasets. This development could lead to more efficient model training and enhanced performance in applications requiring complex reasoning.

#LLM #AI Coding #Inference #Open Source

arXiv cs.AI·Junyi Wen, Ruiyan Zhuang, Yongjia Xu, Pengtu Li, Rui Zou, Hongyi Chen, Chingman Wan, Puxu Yang, Wuhui Chen, Yanlin Wang

3d ago

FeaturedOriginal

Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation

AI Summary

Hawk is a training-free framework that enhances NPU kernel generation accuracy from 49.4% to 80.0% and achieves up to 2.2x speedup over existing methods by leveraging hardware-aware knowledge through three innovative modules.

Why Featured

The development of Hawk, a training-free framework that boosts NPU kernel generation accuracy from 49.4% to 80.0% and offers a 2.2x speedup, is significant for builders and PMs focusing on optimizing AI hardware performance. Investors should note that this advancement could lead to more efficient AI applications and reduced operational costs, enhancing competitiveness in the AI market.

#AI Coding #Robotics #GPU

arXiv cs.AI·Max Van Puyvelde, Halil Ibrahim Gulluk, Wim Van Criekinge, Olivier Gevaert

3d ago

FeaturedOriginal

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

AI Summary

The DiffusionGemma-26B model outperforms its autoregressive counterpart Gemma-4-26B in medical visual question answering, achieving faster decoding and superior drafting capabilities. This diffusion model allows radiologists to infill report fragments bidirectionally, addressing inconsistencies in clinical reports.

Why Featured

The introduction of the DiffusionGemma-26B model, which excels in medical visual question answering and report drafting, signals a shift towards more efficient AI tools in healthcare. For builders and PMs, this means opportunities to integrate advanced AI into clinical workflows, while investors should note the potential for improved accuracy and speed in medical documentation, enhancing overall patient care.

#LLM #AI Coding #Inference #AI Assistant

arXiv cs.CL·Dekun Yang

3d ago

FeaturedOriginal

Prompt Framing Distorts Count-Based Evaluation of LLM Error Detection: Evidence from Numeric Anchoring

AI Summary

This study reveals that count-based F1 scores can inflate without genuine improvements in error detection, highlighting a significant gap termed F1 Inflation. Using ErrorBench, it was found that anchored prompts can inflate F1 scores by up to 0.79 points, suggesting that LLM evaluations should prioritize span-aware metrics over pre-populated error counts.

Why Featured

The study on F1 Inflation in LLM error detection highlights the risk of misleading performance metrics when using count-based evaluations. Builders and PMs should prioritize span-aware metrics for more accurate assessments of model performance, while investors should be cautious about the inflated claims stemming from traditional evaluation methods.

#LLM #AI Coding #Policy

arXiv cs.CL·Prashanna Mani Paudel, Shivanand Venkanna Sheshappanavar

3d ago

FeaturedOriginal

Parameter Golf: What Really Works?

AI Summary

The Parameter Golf challenge tested language model optimization under a strict 16 MB artifact budget, achieving a 13.6% reduction in bits-per-byte (BPB) from 1.2244 to 1.058 across 2,037 submissions. Despite individual techniques showing minimal improvements, a taxonomy of 84 methods was developed to isolate effective strategies.

Why Featured

The Parameter Golf challenge demonstrated a 13.6% reduction in bits-per-byte for language model optimization, highlighting the effectiveness of specific strategies over individual techniques. This development is crucial for builders and PMs as it provides a structured approach to improve model efficiency, which can lead to cost savings and better performance in AI applications.

#LLM #AI Coding #Inference

arXiv cs.CL·Ruchao Fan, Yiming Wang, Rui Zhao, Liliang Ren, Keqi Deng, Xiaoyang Chen, Ali Zare, Bo Ren, Yuxuan Hu, Junkun Chen, Yan Huang, Yelong Shen, Jinyu Li

3d ago

FeaturedOriginal

Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving

AI Summary

The proposed Joint Speech-Text Interleaved Pretraining (JSTIP) enhances ASR performance by interleaving speech-text sequences, achieving improved entity accuracy on 38k hours of data. JSTIP matches domain transcription performance while simplifying adaptation, outperforming traditional ASR and joint training methods, particularly in medical entity recognition.

Why Featured

The development of Joint Speech-Text Interleaved Pretraining (JSTIP) significantly enhances automatic speech recognition (ASR) performance, particularly in specialized fields like medical entity recognition. This advancement implies that builders can create more accurate and adaptable ASR systems, PMs can streamline product development timelines, and investors can identify promising opportunities in AI-driven healthcare solutions.

#LLM #AI Coding #Inference

arXiv cs.CL·\'Ad\'am Kov\'acs, Nadia Verdha, G\'abor Recski

3d ago

FeaturedOriginal

RuleChef: Grounding LLM Task Knowledge in Human-Editable Rules

AI Summary

RuleChef is an open-source framework that leverages large language models to generate and iteratively improve executable rules for NLP tasks like text classification and Named Entity Recognition. By synthesizing rules from task descriptions and labeled examples, it creates a fast and inspectable rule system, enhancing performance through human feedback and additional examples.

Why Featured

The development of RuleChef, an open-source framework that generates and refines executable rules for NLP tasks, is significant for builders and PMs as it enables faster deployment of robust models through human-editable rules. For investors, this represents a potential shift towards more transparent and adaptable AI systems, which could lead to better performance and reduced operational costs.

#LLM #AI Coding #Open Source

arXiv cs.CL·Samir Abdaljalil, Erchin Serpedin, Hasan Kurban

3d ago

FeaturedOriginal

IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs

AI Summary

ISOSCI benchmark reveals that 91.3% of reasoning-mode gains in LLMs are knowledge-dependent, challenging the assumption that chain-of-thought reasoning enhances scientific problem-solving. Notably, the reasoning-specialized model o3-mini outperformed on but underperformed on ISOSCI, indicating benchmark choice significantly influences conclusions about reasoning utility.

Why Featured

The ISOSCI benchmark reveals that 91.3% of reasoning-mode gains in LLMs depend on knowledge retrieval, challenging the effectiveness of reasoning techniques in scientific problem-solving. This suggests that builders and PMs should prioritize knowledge integration in LLMs, while investors should be cautious about models that emphasize reasoning without robust knowledge bases.

#LLM #AI Coding #Policy

arXiv cs.CL·Tianyi Zhang, Mousumi Das, Abrar Anwar, Jesse Thomason, David Traum

3d ago

FeaturedOriginal

DiPS: Dialogue Policy Selection for High-Stakes Persuasion Agents

AI Summary

The DiPS framework utilizes Q-learning to dynamically select tailored persuasion strategies in high-stakes scenarios, achieving higher evacuation success rates than zero-shot LLMs and generic -augmented methods. Evaluated in fire-rescue contexts, DiPS adapts to individual resident responses, significantly improving outcomes in critical situations.

Why Featured

The DiPS framework represents a significant advancement in AI-driven persuasion strategies, particularly in high-stakes scenarios like fire-rescue operations. For builders and PMs, this technology can enhance user interaction and decision-making in critical applications, while investors should note its potential for improving safety outcomes and operational efficiency in emergency response systems.

#LLM #Agent #Robotics #AI Assistant

arXiv cs.AI·Junyan Tan, Haoran Lin, Siyuan Guo, Yichen Fang, Xinyue Luo, Tianyu Shen, Zeyu Qiao

3d ago

FeaturedOriginal

Safe and Adaptive Cloud Healing: Verifying LLM-Generated Recovery Plans with a Neural-Symbolic World Model

AI Summary

The PASE framework introduces a Planning-Aware Semantic self-healing engine that utilizes LLMs for generating recovery plans and a Neural-Symbolic World Model for plan verification, achieving over 40% reduction in recovery time and improved fault detection accuracy in cloud systems.

Why Featured

The introduction of the PASE framework, which combines LLMs for generating recovery plans with a Neural-Symbolic World Model for verification, significantly reduces recovery time by over 40% in cloud systems. This advancement is crucial for builders and PMs as it enhances operational efficiency and reliability, while investors should note its potential for reducing costs and improving service quality in cloud infrastructure.

#LLM #Inference #AI Startup

arXiv cs.AI·Mahyar Ghazanfari, Amin Tabrizian, Armin Mehrabian, Peng Wei

3d ago

FeaturedOriginal

EO-Agents: A Three-Agent LLM Pipeline for Earth Observation Hypothesis Generation

AI Summary

The EO-Agents pipeline utilizes a three-agent LLM system to generate scientifically grounded hypotheses from NASA's Earth Observation Knowledge Graph, producing 160 hypotheses across various Earth science domains. A factorial experiment reveals stable hypothesis rankings across models GPT-5.2 and Claude Sonnet 4.6, while highlighting the variability in absolute scores based on judge identity.

Why Featured

The EO-Agents pipeline introduces a three-agent LLM system that generates scientifically grounded hypotheses from NASA's Earth Observation Knowledge Graph, demonstrating the potential for AI to enhance research efficiency in Earth sciences. Builders and PMs can leverage this model to develop applications that automate hypothesis generation, while investors should note the growing intersection of AI and environmental research as a promising market.

#LLM #Inference #AI Startup

arXiv cs.CV·Neda Abdolrahimi, Thiru Siddharth, Frank Sicongchen, Vir V Phoha

3d ago

FeaturedOriginal

Sign in the Air to Unlock: An Interface for authentication in Virtual and Augmented Reality Powered by Point-Voxel Cross-Attention Network

AI Summary

The 'Sign in the Air to Unlock' interface utilizes a point-voxel Cross-Attention Network (PV-Net) for 3D signature authentication in VR/AR, achieving a 2.5% Equal Error Rate on the DeepAirSig dataset and 76% accuracy on ImmAirsig, enhancing user-centric security without disrupting immersion.

Why Featured

The development of the 'Sign in the Air to Unlock' interface using a point-voxel Cross-Attention Network enhances security in VR/AR applications with a low Equal Error Rate and high accuracy. This innovation is crucial for builders and PMs as it allows for seamless user authentication, which can drive user adoption and investment in immersive technologies.

#Robotics #Security #AI Assistant

arXiv cs.AI·Yuante Li, Yicheng Tao, Kate Zhang, Taozhi Wang, Gefei Gu, Yaxin Zhou

3d ago

FeaturedOriginal

Diverse Evidence, Better Forecasts: Deliberation Under Information Asymmetry

AI Summary

The InfoDelphi framework enhances multi-agent forecasting by introducing information asymmetry, outperforming traditional models by 12-18% in Brier score on the PolyGym benchmark. This approach allows agents to hold exclusive knowledge, leading to improved accuracy and reduced error correlation, establishing input diversity as crucial for effective reasoning.

Why Featured

The introduction of the InfoDelphi framework for multi-agent forecasting, which enhances accuracy by leveraging information asymmetry, is significant for builders and PMs as it suggests a new paradigm for developing predictive models. Investors should note that this advancement could lead to more reliable forecasting tools, potentially improving decision-making and risk assessment in various sectors.

#Agent #Inference

Featured

Want this in your inbox every morning?

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

Safeguarding LLM Agents from Misalignment through Provenance Analysis

Introducing TabFM: A zero-shot foundation model for tabular data

Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction

TokenScope: Token-Level Explainability and Interpretability for Code-Oriented Tasks in Large Language Models

FaithMed: Training LLMs For Faithful Evidence-Based Medical Reasoning

RusFinChain: A Russian Benchmark for Verifiable Chain-of-Thought Reasoning in Finance with Fuzzy-Aligned Evaluation

Revisiting Chain-of-Thought Reasoning under Limited Supervision: Semi-supervised Chain-of-Thought Learning

Hawk: Harnessing Hardware-Aware Knowledge for High-Performance NPU Kernel Generation

Discrete Diffusion Language Models for Interactive Radiology Report Drafting

Prompt Framing Distorts Count-Based Evaluation of LLM Error Detection: Evidence from Numeric Anchoring

Parameter Golf: What Really Works?

Rethinking Speech-LLM Integration for ASR: Effective Joint Speech-Text Training by Interleaving

RuleChef: Grounding LLM Task Knowledge in Human-Editable Rules

IsoSci: A Benchmark of Isomorphic Cross-Domain Science Problems for Evaluating Reasoning versus Knowledge Retrieval in LLMs

DiPS: Dialogue Policy Selection for High-Stakes Persuasion Agents

Safe and Adaptive Cloud Healing: Verifying LLM-Generated Recovery Plans with a Neural-Symbolic World Model

EO-Agents: A Three-Agent LLM Pipeline for Earth Observation Hypothesis Generation

Sign in the Air to Unlock: An Interface for authentication in Virtual and Augmented Reality Powered by Point-Voxel Cross-Attention Network

Diverse Evidence, Better Forecasts: Multi-Agent Deliberation Under Information Asymmetry

Diverse Evidence, Better Forecasts: Deliberation Under Information Asymmetry