arXiv cs.AI

https://arxiv.org/list/cs.AI/recent

Latest AI signals from arXiv cs.AI

AI research updates from arXiv cs.AI, filtered for agents, planning, reasoning, evaluation and AI systems with readable summaries and signal scores.

DeepSignal tracks AI updates from arXiv cs.AI, filtering research and product signals into plain-English summaries, signal scores and source-linked article pages.

Current topics: Research, LLM, Agent, AI Assistant, AI Coding

High-signal updates

When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models78 signal
Why Solve It Twice? Hierarchical Accumulation of Skills for Transfer-Efficient ML Engineering78 signal
Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics78 signal

arXiv cs.AI·Ramin Pishehvar

10h ago

FeaturedOriginal

A Three-Phase Foundation Model for Tax-Aware Personalized Portfolio Management

AI Summary

This paper introduces a three-phase deep reinforcement learning model for personalized portfolio management, addressing ticker lock-in, monolithic objectives, and static user models. It employs a T5-based time series model for asset encoding, a Mixture of Experts architecture for diverse investment goals, and a personalized inference layer using transaction history, marking a significant advancement in financial AI applications.

Why Featured

The introduction of a three-phase deep reinforcement learning model for personalized portfolio management represents a significant advancement in financial AI, allowing for more tailored investment strategies that adapt to individual user behaviors and goals. This could lead to improved investment performance and customer satisfaction, making it a critical development for builders and PMs in the fintech space, as well as for investors seeking more effective portfolio management tools.

#LLM #Inference #AI Assistant #Enterprise AI

2

arXiv cs.AI·Bart{\l}omiej Cupia{\l}, Jan {\L}ojek, Miko{\l}aj Garstecki, Szymon Pob{\l}ocki, Alicja Ziarko, Piotr Mi{\l}o\'s

10h ago

Original

What Drives Interactive Improvement from Feedback?

AI Summary

The study reveals that multi-turn language agents show limited improvement from self-generated feedback compared to strong external feedback, emphasizing the importance of the student's ability to act on feedback. The controlled evaluation across models like Omni-MATH and Codeforces indicates that feedback must provide specific guidance to enhance performance effectively.

Why Featured

The study highlights that multi-turn language agents benefit more from strong external feedback than from self-generated feedback, indicating that builders and PMs should prioritize developing systems that can provide specific, actionable guidance. For investors, this suggests that products focused on enhancing feedback mechanisms may have a competitive edge in improving AI performance.

#LLM #Agent #AI Assistant

0

arXiv cs.AI·Zhe Dong (University of Maine at Presque Isle), Fang Qin (Stanford University), Manish Shah (Independent Researcher)

10h ago

FeaturedOriginal

When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

AI Summary

LearnStop, a checkpoint stopper for reasoning models, shows task-dependent benefits in early exits. In free-form math tasks like GSM8K with Qwen3-32B, it achieves a +0.157 peak adapt gain, outperforming scalar exits, while scalar rules remain competitive in multiple-choice settings.

Why Featured

The introduction of LearnStop, a checkpoint stopper for reasoning models, highlights the importance of task-dependent strategies in AI performance. Builders and PMs should consider integrating such adaptive mechanisms to optimize model efficiency and effectiveness, while investors may find opportunities in technologies that enhance AI reasoning capabilities, leading to better outcomes in diverse applications.

#LLM #AI Coding #Inference

0

arXiv cs.AI·Derek Koh, Jinghui Mo, Benjamin H. Le, Jiening Zhan, Baofen Zheng, Kevin Bevis, Nathaniel C. Owen, Lauren Elizabeth Charney, Wenqiong Liu, Jingwei Wu

10h ago

FeaturedOriginal

Contrastive Reflection for Iterative Prompt Optimization

AI Summary

The Contrastive Reflection framework enhances iterative prompt optimization for LLM agents in information retrieval, improving exact-match accuracy from 51.4% to 60.4% on HotpotQA. By leveraging error-anchored behavioral slices and targeted prompt edits, it ensures validation-driven improvements without regressions, outperforming other methods like MIPROv2 and GEPA.

Why Featured

The development of the Contrastive Reflection framework significantly improves iterative prompt optimization for LLM agents, increasing exact-match accuracy on HotpotQA from 51.4% to 60.4%. This advancement offers builders and PMs a more effective method for enhancing AI performance in information retrieval tasks, which can lead to better user experiences and more reliable applications, attracting investor interest in improved AI capabilities.

#LLM #AI Search #AI Assistant

0

arXiv cs.AI·Ankur Samanta, Akshayaa Magesh, Tal Lancewicki, Ayush Jain, Youliang Yu, Paul Sajda, Kaveh Hassani, Aditya Modi, Daniel R. Jiang, Yonathan Efroni

10h ago

FeaturedOriginal

BayesBench: Evaluating LLM Belief Trajectories Under Multi-Turn Evidence Accumulation

AI Summary

BayesBench evaluates LLMs' belief updates in multi-turn conversations, revealing that while scaling improves latent inference, it doesn't consistently enhance downstream predictions. The study assesses seven LLMs (3B-70B) across Bayesian tasks, highlighting a gap between inferring latent structures and rational belief updates.

Why Featured

The development of BayesBench provides a framework for evaluating how large language models (LLMs) update their beliefs during multi-turn conversations, highlighting the limitations of scaling in improving predictive accuracy. This insight is crucial for builders and PMs to refine LLM applications, ensuring they can effectively manage user interactions and expectations in real-world scenarios.

#LLM #Inference

0

arXiv cs.AI·Cor Steging, Ludi van Leeuwen, Tadeusz Zbiegie\'n

10h ago

FeaturedOriginal

Investigating Deliberation in Law

AI Summary

This study explores multi-agent deliberation methods for legal reasoning using Large Language Models (LLMs), revealing that these frameworks can outperform traditional models in specific scenarios. The introduced frameworks, inspired by courtroom procedures, demonstrate comparable performance to baseline LLMs while addressing unique legal cases. The findings suggest that multi-agent systems could significantly enhance AI applications in the legal domain.

Why Featured

The study on multi-agent deliberation in legal reasoning demonstrates that these frameworks can outperform traditional models in specific scenarios, indicating a potential shift in how AI can be applied in the legal domain. Builders and PMs should consider integrating multi-agent systems into their legal tech solutions, while investors may see opportunities in startups leveraging this advanced approach to enhance legal decision-making processes.

#LLM #Agent #AI Assistant #Policy

0

arXiv cs.AI·Anish Acharya, Kris W Pan, Brian Verkhovsky

10h ago

FeaturedOriginal

RoPoLL: Robust Panel of LLM Judges

AI Summary

RoPoLL, a robust panel of LLM judges, outperforms traditional LLM jury methods by mitigating bias from individual judges, achieving a 19% improvement on cross-dimensional attacks and significantly outperforming Mistral-Large-3 in specific corruption scenarios. It utilizes a geometric median for aggregation, ensuring optimal performance against up to 50% corruption rates.

Why Featured

The development of RoPoLL, a robust panel of LLM judges, is significant as it demonstrates a 19% improvement in mitigating bias and handling corruption in AI systems. This advancement provides builders and PMs with a more reliable framework for evaluating AI performance, while investors can recognize the potential for more trustworthy AI applications in critical sectors.

#LLM #AI Assistant

0

arXiv cs.AI·Yongbin Kim, Yashar Talebirad, Osmar R. Zaiane

10h ago

FeaturedOriginal

Why Solve It Twice? Hierarchical Accumulation of Skills for Transfer-Efficient ML Engineering

AI Summary

HASTE, a hierarchical for ML engineering, organizes knowledge into three tiers, achieving a 100% medal rate with tiered loading compared to 62.5% with flat loading. In 22 Kaggle competitions, it reached a 77.3% medal rate using Claude Sonnet 4.6, demonstrating that better knowledge organization can enhance performance while reducing compute costs.

Why Featured

The development of HASTE, a hierarchical multi-agent system for ML engineering, demonstrates that organizing knowledge into structured tiers can significantly enhance performance in machine learning competitions while reducing compute costs. This signals to builders and PMs the importance of knowledge management in AI projects, and to investors, it highlights a promising approach for more efficient and effective ML solutions.

#Agent #AI Coding #Inference

3

arXiv cs.AI·Nicolaie Popescu-Bodorin, Madeleine Togher

10h ago

FeaturedOriginal

Neuro-Bayesian-Symbolic Residual Attention Shallow Network: Explainable Deep Learning for Cybersecurity Risk Assessment

AI Summary

The Neuro-Bayesian-Symbolic Residual Attention Shallow Network (NBS-RASN) offers a novel approach to explainable cybersecurity risk assessment, achieving confidence scores between 0.79 and 0.97 across 20 open-source projects. This shallow network incorporates domain knowledge and causal reasoning, proving that interpretability can coexist with performance, challenging the notion that deep models are necessary for effective learning in high-stakes environments.

Why Featured

The development of the Neuro-Bayesian-Symbolic Residual Attention Shallow Network (NBS-RASN) demonstrates that effective cybersecurity risk assessment can be achieved with explainable models, challenging the reliance on complex deep learning systems. This has practical implications for builders and PMs in creating more interpretable solutions, while investors may see opportunities in companies leveraging such innovative approaches to enhance security without sacrificing transparency.

#Open Source #Security #AI Assistant

0

arXiv cs.AI·Jingpu Yang, Fengxian Ji, Zhengzhao Lai, Zhexuan Cui, Guangxian Ouyang, Qian Jiang, Fan Zhang, Min Peng, Qianqian Xie, Preslav Nakov, Zhuohan Xie

10h ago

Original

LabGuard: Grounding Natural-Language Laboratory Rules into Runtime Guards for Embodied Laboratory Agents

AI Summary

LabGuard introduces a safety suite that translates natural-language laboratory rules into executable specifications, reducing unsafe events from 39.5% to 23.8%. With a task-scope F1 score of 79.4, it effectively integrates runtime monitors in dynamic lab environments, maintaining intervention rates below 0.5%.

Why Featured

LabGuard's ability to translate natural-language laboratory rules into executable specifications significantly enhances safety in dynamic lab environments, reducing unsafe events by 15.7%. This development is crucial for builders and PMs focused on safety compliance in robotics, while investors may see potential in scalable applications across various industries requiring automated safety protocols.

#Agent #Robotics #AI Assistant

1

arXiv cs.AI·Ke Zhang, Patricio Gallardo Candela, Sudhir Murthy, Yi Xie, Zhi Wang, Maziar Raissi

10h ago

Original

Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

AI Summary

This study evaluates natural-language-to-Lean formalization, revealing a 29.0-point gap between compilation success (89.5%) and consensus faithfulness (60.5%). The findings suggest that existing models struggle with faithful statement generation, emphasizing the need for separate reporting of formal validity and proof-oriented competence.

Why Featured

The study highlights a significant 29.0-point gap between successful compilation and faithfulness in natural-language-to-Lean formalization, indicating that current AI models may not reliably generate accurate formal statements. Builders and PMs should focus on improving model training for better fidelity in outputs, while investors should consider the implications for the reliability of AI applications in formal verification tasks.

#LLM #AI Coding

0

arXiv cs.AI·Zhenqian Shen, Yu Liu, Xiaoyi Fu, Quanming Yao

10h ago

FeaturedOriginal

DDIAgents: Mechanism-Conditioned Context Flow for Drug-Drug Interaction Prediction

AI Summary

DDIAgents introduces a mechanism-conditioned framework for drug-drug interaction (DDI) prediction, enhancing interpretability and performance. It outperforms traditional models across various benchmarks by reducing irrelevant information and leveraging expert reasoning. This approach showcases the potential of multi-agent systems in organizing heterogeneous biomedical knowledge for adaptive AI4Science applications.

Why Featured

The introduction of DDIAgents for drug-drug interaction prediction highlights the effectiveness of multi-agent systems in biomedical applications, offering improved interpretability and performance. This development signals a shift towards more adaptive AI solutions in healthcare, which could attract investment and drive innovation in drug discovery and safety monitoring.

#Agent #AI Coding #AI Startup

1

arXiv cs.AI·Irena Saracay, Ludwig Schmidt, Carlos Guestrin

10h ago

Original

Beyond expert users: agents should help users construct preferences, not just elicit them

AI Summary

The study critiques the assumption that users have well-defined preferences, proposing CoPref and CoShop to help users construct preferences through agent interactions. Despite evaluating five models, none achieved over 56% accuracy, highlighting the need for agents to enhance user knowledge rather than just retrieve items.

Why Featured

The development of CoPref and CoShop highlights the need for AI agents to assist users in constructing their preferences rather than merely retrieving items based on assumed preferences. This suggests that builders and PMs should focus on enhancing user engagement and knowledge through interactive AI, while investors should consider the potential for improved personalization in AI products.

#Agent #AI Assistant

0

arXiv cs.AI·Arshia Soltani Moakhar, Iman Gholami, Max Springer, Mahdi JafariRaviz, MohammadTaghi Hajiaghayi

10h ago

FeaturedOriginal

Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics

AI Summary

This paper introduces an agentic framework for autoformalizing research mathematics using general coding LLMs, outperforming smaller models in Lean 4. The system dynamically extends type definitions and validates them before formalizing theorems, successfully producing machine-checked proofs for 32 PutnamBench problems and five ACM STOC papers.

Why Featured

The introduction of an agentic framework for autoformalizing research mathematics using general coding LLMs signifies a major advancement in automating theorem proving, which can enhance the efficiency of mathematical research and validation processes. For builders and PMs, this development opens opportunities to integrate advanced AI tools into academic and research applications, while investors may see potential for commercialization in educational and AI-driven research platforms.

#LLM #Agent #AI Coding

2

arXiv cs.AI·Jhon G. Botello, Jose J. Padilla, Erika Frydenlund, Krzysztof Rechowicz, Eric Weisel

10h ago

FeaturedOriginal

How Can AI Find My Model? A Model-Finding Experimental Study Considering Data Formats, Embeddings, and Retrieval Strategies

AI Summary

This study explores how data representation, transformer-based embeddings, and retrieval strategies impact the discovery of simulation models through natural language queries. Results indicate that open-source embedding models perform well, and reranking methods are crucial as query complexity increases, providing a baseline for AI-driven model discovery.

Why Featured

This study highlights the effectiveness of open-source embedding models and the importance of reranking methods in AI-driven model discovery. For builders and PMs, this means they can leverage these insights to improve model retrieval systems, enhancing user experience and efficiency, while investors should note the potential for scalable solutions in AI applications across various industries.

#Inference #Open Source #AI Search

0

arXiv cs.AI·Anuj Kaul, Qianlong Lan, Pranay Gupta

10h ago

FeaturedOriginal

AgentBound: Verifiable Behavioral Governance for Autonomous AI Agents

AI Summary

AgentBound introduces a runtime governance framework for autonomous AI agents, ensuring verifiable behavioral oversight through delegated authorization, owner-signed constitutions, and site action contracts. It generates cryptographically verifiable governance receipts, enhancing accountability and allowing independent verification of actions while supporting long-running agents with refreshed governance policies.

Why Featured

AgentBound's introduction of a runtime governance framework for autonomous AI agents allows builders and PMs to implement verifiable oversight mechanisms, enhancing accountability and trust in AI systems. For investors, this development signals a move towards more responsible AI deployment, potentially reducing regulatory risks and increasing the attractiveness of AI solutions in the market.

#Agent #Security #Policy

0

arXiv cs.AI·Zihan Chen, Songwei Dong, Chengshuai Shi, Peng Wang, Song Wang, Cong Shen, Jundong Li

10h ago

FeaturedOriginal

The Past Is Prologue: A Plug-in Controller for Selective Updates in Sequentially Evolving LLM Memory

AI Summary

Janus is a plug-in memory controller for LLMs that selectively updates memory, improving accuracy by 2.7 to 4.6 points across various datasets. It uses a Memory Momentum Trigger to evaluate updates efficiently, preventing the loss of useful knowledge. This method is agnostic to existing updaters, enhancing performance without altering their rules.

Why Featured

The development of Janus, a plug-in memory controller for LLMs, matters because it enhances the accuracy of memory updates by 2.7 to 4.6 points without changing existing systems. This improvement allows builders and PMs to create more reliable AI applications, while investors can recognize its potential for increasing the value of AI products in the market.

#LLM #AI Coding

0

arXiv cs.AI·Anjali Parashar, Chuchu Fan

10h ago

FeaturedOriginal

Scenario Generation for Testing of Autonomous Driving Systems Using Real-World Failure Records

AI Summary

This study introduces a novel scenario generation pipeline for Autonomous Driving Systems (ADS) testing, leveraging historical failure records in natural language. By utilizing modular LLM-based synthetic scenario generation, the method produces diverse scenarios compatible with testing constraints, successfully applying it to generate 20 scenarios for the Metadrive simulator using NHTSA ADS crash data.

Why Featured

The introduction of a novel scenario generation pipeline for Autonomous Driving Systems, which utilizes historical failure records to create diverse testing scenarios, is significant for builders and PMs as it enhances the robustness of ADS testing. For investors, this development indicates a potential reduction in testing costs and time, improving the viability of autonomous vehicle technologies in the market.

#LLM #Robotics #AI Assistant

0

arXiv cs.AI·Atsushi Masumori, Itsuki Doi, Norihiro Maruyama, Ryosuke Takata, Takashi Ikegami

10h ago

FeaturedOriginal

OpenLife: Toward Open-World Artificial Life with Autonomous LLM Agents

AI Summary

OpenLife introduces open-world Artificial Life (ALIFE) using autonomous LLM agents with persistent memory and social dynamics, demonstrating emergent behaviors over twelve weeks. The project showcases a shift from reactive to spontaneous activities and the formation of distinct agents with their own income, marking a significant step toward living AI.

Why Featured

The development of OpenLife's autonomous LLM agents with persistent memory signifies a major advancement in creating AI that can exhibit emergent behaviors and social dynamics. This has practical implications for builders and PMs in designing more interactive and adaptive systems, while investors may see potential in applications across gaming, simulation, and AI-driven social platforms.

#LLM #Agent #AI Startup

0

arXiv cs.AI·Arshia Rafieioskouei, Tzu-Han Hsu, Matthew Lucas, Borzoo Bonakdarpour

10h ago

FeaturedOriginal

HyPOLE: Hyperproperty-Guided Reinforcement Learning under Partial Observation

AI Summary

HyPOLE introduces a novel framework for Multi-Agent Reinforcement Learning (MARL) under partial observability, leveraging hyperproperties and HyperLTL for guidance. Evaluations on SMAC, MessySMAC, and WildFire benchmarks show significant performance improvements over traditional methods, demonstrating the effectiveness of Centralized Training for Decentralized Execution (CTDE) techniques in synthesizing decentralized policies.

Why Featured

The introduction of HyPOLE, a framework for Multi-Agent Reinforcement Learning (MARL) that utilizes hyperproperties for guidance, signifies a substantial advancement in developing decentralized policies under partial observability. This can enhance the efficiency and effectiveness of AI systems in complex environments, making it a critical consideration for builders and investors focused on scalable AI solutions.

#Agent #AI Coding

0

arXiv cs.AI·Huaze Tang, Bill Zeng, Chao Wang, Zhenpeng Shi, Qian Zhang, Wenbo Ding

10h ago

FeaturedOriginal

Revealing Safety-Critical Scenarios for UTM via Transformer

AI Summary

This study presents a transformer-based reinforcement learning approach for identifying vulnerabilities in Unmanned Traffic Management (UTM) systems, achieving an 8x improvement in discovery efficiency over expert-guided testing. The proposed framework utilizes attention mechanisms to model system states and generate targeted test scenarios, effectively uncovering critical edge cases missed by traditional methods.

Why Featured

The development of a transformer-based reinforcement learning approach for identifying vulnerabilities in Unmanned Traffic Management systems significantly enhances testing efficiency, achieving an 8x improvement. This is crucial for builders and PMs focused on safety and reliability in autonomous systems, while investors should note the potential for reduced costs and faster deployment of safer UTM solutions.

#AI Coding #Inference #Robotics

0

arXiv cs.AI·Sheng Zhang, Qinglin Li, Yuechao Zang, Xueqin Huang, Yijia Fu, Cheng Zhu

10h ago

FeaturedOriginal

MultiUAV-Plat: An LLM-Oriented Platform, Benchmark and Framework for Multi-UAV Collaborative Task Planning

AI Summary

MultiUAV-Plat introduces a lightweight platform for multi-UAV collaborative task planning, featuring 75 mission sessions and 1500 tasks. The Agent4Drone framework outperforms a ReAct baseline with a 57.9% task pass rate, significantly enhancing LLM-driven UAV autonomy under realistic constraints.

Why Featured

The development of the MultiUAV-Plat platform enhances LLM-driven UAV autonomy, achieving a 57.9% task pass rate in collaborative planning. This improvement signals a significant advancement in multi-UAV applications, presenting opportunities for builders and PMs to develop more efficient drone solutions, while investors may see potential in the growing UAV market.

#LLM #Agent #Robotics

2

arXiv cs.AI·Yang Zou, Zijian Ding, Yizhou Sun, Jason Cong

10h ago

FeaturedOriginal

AgRefactor: Self-Evolving Agentic Workflow for HLS Compatibility and Performance

AI Summary

AgRefactor is an LLM-based workflow that refactors software into HLS-compatible code, achieving a 6.51x speedup over state-of-the-art tools on complex benchmarks. It utilizes a self-evolving memory system to enhance efficiency and scalability, outperforming existing methods on 9 out of 11 challenging real-world cases. Fully automated and open-sourced, it addresses the gap between software and hardware programming practices.

Why Featured

AgRefactor's self-evolving multi-agent workflow can significantly streamline the process of converting software to HLS-compatible code, offering a 6.51x speedup over existing tools. This development is crucial for builders and PMs looking to optimize performance in hardware-software integration, while investors should note its potential to disrupt the software development landscape.

#LLM #Agent #AI Coding #Open Source

2

arXiv cs.AI·Maria Xenochristou, Ashutosh Joshi, Korosh Vatanparvar, Mohammad Abuzar Hashemi, Prasad Kasu, Deepak Bansal, Anchal Nema, Nivedita Wadhwa, Prashams S Jain, Rebecca Abraham, Will Kimbrough, Dilek Hakkani-Tur, Wilko Schulz-Mahlendorf

1d ago

Original

IMCBench: A benchmark for multimodal LLMs in Image-grounded Medical Conversations

AI Summary

IMCBench introduces a novel benchmark for multimodal large language models (LLMs) in medical conversations, pairing clinical images with synthetic patient profiles. The evaluation of eight models, including Claude Opus 4.6, reveals that while it scores highest overall (3.61), safety concerns persist, particularly for malignant and rare conditions, highlighting the need for multi-dimensional assessment frameworks in medical AI.

Why Featured

The introduction of IMCBench for evaluating multimodal LLMs in medical conversations is significant as it highlights the need for robust assessment frameworks to address safety concerns in AI applications. Builders and PMs should consider integrating such benchmarks to ensure reliability in healthcare AI, while investors may see opportunities in companies that prioritize safety and efficacy in their AI solutions.

#LLM #AI Image #AI Assistant #Policy

0

arXiv cs.AI·Shahnewaz Karim Sakib, Anindya Bijoy Das

1d ago

FeaturedOriginal

Memory as an Attack Surface in LLM Agents: A Study on Multiple-Choice Question Answering

AI Summary

This study investigates the vulnerability of LLM-based agents, particularly in multiple-choice question answering, due to memory manipulation. By implementing an external memory component, the research demonstrates that even simple corruptions can significantly alter the agent's responses, leading to incorrect selections despite clean queries. The findings highlight the need for robust memory management in AI systems to mitigate these risks.

Why Featured

The study on memory vulnerabilities in LLM agents reveals that external memory manipulation can lead to incorrect responses in multiple-choice question answering. This underscores the critical need for builders and PMs to prioritize robust memory management in AI systems, as investors should consider the implications for reliability and trustworthiness in AI applications.

#LLM #Agent #Security

0

arXiv cs.AI·Zhixuan Li, Jiangan Yuan, Han Xu

1d ago

Original

Data and Evaluation Closed-Loop for Model Capability Enhancement

AI Summary

The study introduces the 'capability slice' to bridge the gap between model evaluation and data optimization, demonstrating its effectiveness in two case studies. In one, targeted data intervention improved BBH performance by 66.44% without altering the dataset, while in another, a focused sampling strategy enhanced math-reasoning scores from 0.00 to 26.67.

Why Featured

The introduction of the 'capability slice' for model evaluation and data optimization is significant as it demonstrates a way to enhance model performance dramatically without the need for extensive data changes. Builders and PMs can leverage this approach to improve their AI models efficiently, while investors may see it as a signal of advancing methodologies that reduce costs and time in model development.

#LLM #AI Coding #Inference

0

arXiv cs.AI·Michael Nguyen, Quoc Nguyen, Paul Vuong

1d ago

Original

Recursive Self-Evolving Agents via Held-Out Selection

AI Summary

The Recursive Self-Evolving Agent (RSEA) outperforms existing methods like ReAct on the ALFWorld benchmark, achieving 69.3% accuracy, and 79.4% with retries. RSEA's strict held-out selection ensures it never underperforms compared to the base agent, while unguarded context evolution proves to be high-variance and unsafe across tasks. Overall, no single artifact consistently excels across benchmarks.

Why Featured

The development of Recursive Self-Evolving Agents (RSEA), which outperforms existing methods on the ALFWorld benchmark, highlights the importance of robust selection mechanisms in AI. Builders and PMs should consider integrating such adaptive models to enhance performance while managing risks associated with unguarded evolution, which can lead to inconsistencies across tasks.

#Agent #Inference #AI Assistant

0

arXiv cs.AI·Shanghua Gao, Ayush Noori, Richard Zhu, Curtis Ginder, Zhenglun Kong, Xiaorui Su, Justin Kauffman, Benjamin S. Glicksberg, Joshua Lampert, Ankit Sakhuja, Ashwin Sawant, ATHENA-R1 Evaluation Consortium, David A. Clifton, Noa Dagan, Ran Balicer, Marinka Zitnik

1d ago

FeaturedOriginal

An AI agent for treatment reasoning over a biomedical tool universe

AI Summary

ATHENA-R1 is an AI agent for treatment reasoning, outperforming existing models with 94.7% accuracy in drug reasoning and 82.9% in treatment reasoning. Trained using reinforcement learning across 3,168 drug tasks and 456 patient cases, it shows significant improvements over GPT-5 by 17.8 and 10.7 points respectively.

Why Featured

The development of ATHENA-R1, an AI agent achieving 94.7% accuracy in drug reasoning, represents a significant leap in biomedical AI applications. This advancement can lead to more effective treatment plans, making it a critical consideration for builders and PMs in healthcare tech, while investors may find opportunities in the growing market for AI-driven medical solutions.

#Agent #AI Coding #Inference #AI Startup

0

arXiv cs.AI·David Courtis, Ting Hu

1d ago

Original

Mechanistic Personality Analysis of LLMs Steering Personality via Latent Feature Interventions

AI Summary

This study introduces a mechanistic interpretability approach for Large Language Models (LLMs) that enhances OCEAN personality traits through latent feature interventions. By using sparse autoencoders and contrastive activation analysis, the method applies targeted shifts in hidden states, achieving improved personality control while maintaining high performance on standard benchmarks.

Why Featured

The introduction of a mechanistic interpretability approach for LLMs that enhances OCEAN personality traits through latent feature interventions is significant for builders and PMs as it provides a method to create more tailored and engaging AI interactions. For investors, this development signals a potential for improved user experience and retention in AI applications, which could lead to increased market competitiveness.

#LLM #AI Coding

0

arXiv cs.AI·Simrita Singh, Naireet Ghosh, Tinglong Dai

1d ago

Original

Managing the Human Fallback: Skill Investment Under Improving AI and Worker Mobility

AI Summary

This study presents a two-period model analyzing how firms should balance AI deployment and worker engagement, revealing that engaging less-skilled workers is cost-effective for fallback purposes, while worker mobility shifts investment towards higher-skilled workers. The model highlights the dual dimensions of AI progress—capability and reliability—and their impact on future skill development.

Why Featured

The study's model highlights the importance of balancing AI deployment with worker engagement, suggesting that firms should invest in both less-skilled workers for fallback and higher-skilled workers due to mobility. This insight is crucial for builders and PMs in workforce planning and for investors in assessing the long-term viability of companies adapting to AI advancements.

#AI Assistant #Policy

0

arXiv cs.AI

Latest AI signals from arXiv cs.AI

A Three-Phase Foundation Model for Tax-Aware Personalized Portfolio Management

What Drives Interactive Improvement from Feedback?

When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

Contrastive Reflection for Iterative Prompt Optimization

BayesBench: Evaluating LLM Belief Trajectories Under Multi-Turn Evidence Accumulation

Investigating Multi-Agent Deliberation in Law

RoPoLL: Robust Panel of LLM Judges

Why Solve It Twice? Hierarchical Accumulation of Skills for Transfer-Efficient ML Engineering

Neuro-Bayesian-Symbolic Residual Attention Shallow Network: Explainable Deep Learning for Cybersecurity Risk Assessment

LabGuard: Grounding Natural-Language Laboratory Rules into Runtime Guards for Embodied Laboratory Agents

Beyond Compilation: Evaluating Faithful Natural-Language-to-Lean Statement Formalization

DDIAgents: Mechanism-Conditioned Context Flow for Drug-Drug Interaction Prediction

Beyond expert users: agents should help users construct preferences, not just elicit them

Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics

How Can AI Find My Model? A Model-Finding Experimental Study Considering Data Formats, Embeddings, and Retrieval Strategies

AgentBound: Verifiable Behavioral Governance for Autonomous AI Agents

The Past Is Prologue: A Plug-in Controller for Selective Updates in Sequentially Evolving LLM Memory

Scenario Generation for Testing of Autonomous Driving Systems Using Real-World Failure Records

OpenLife: Toward Open-World Artificial Life with Autonomous LLM Agents

HyPOLE: Hyperproperty-Guided Multi-Agent Reinforcement Learning under Partial Observation

Revealing Safety-Critical Scenarios for UTM via Transformer

MultiUAV-Plat: An LLM-Oriented Platform, Benchmark and Framework for Multi-UAV Collaborative Task Planning

AgRefactor: Self-Evolving Agentic Workflow for HLS Compatibility and Performance

IMCBench: A benchmark for multimodal LLMs in Image-grounded Medical Conversations

Memory as an Attack Surface in LLM Agents: A Study on Multiple-Choice Question Answering

Data and Evaluation Closed-Loop for Model Capability Enhancement

Recursive Self-Evolving Agents via Held-Out Selection

An AI agent for treatment reasoning over a biomedical tool universe

Mechanistic Personality Analysis of LLMs Steering Personality via Latent Feature Interventions

Managing the Human Fallback: Skill Investment Under Improving AI and Worker Mobility

Investigating Deliberation in Law

HyPOLE: Hyperproperty-Guided Reinforcement Learning under Partial Observation