https://arxiv.org/list/cs.AI/recent
AI research updates from arXiv cs.AI, filtered for agents, planning, reasoning, evaluation and AI systems with readable summaries and signal scores.
DeepSignal tracks AI updates from arXiv cs.AI, filtering research and product signals into plain-English summaries, signal scores and source-linked article pages.
Current topics: Research, LLM, Agent, AI Assistant, AI Coding
High-signal updates
This paper introduces a three-phase deep reinforcement learning model for personalized portfolio management, addressing ticker lock-in, monolithic objectives, and static user models. It employs a T5-based time series model for asset encoding, a Mixture of Experts architecture for diverse investment goals, and a personalized inference layer using transaction history, marking a significant advancement in financial AI applications.
The introduction of a three-phase deep reinforcement learning model for personalized portfolio management represents a significant advancement in financial AI, allowing for more tailored investment strategies that adapt to individual user behaviors and goals. This could lead to improved investment performance and customer satisfaction, making it a critical development for builders and PMs in the fintech space, as well as for investors seeking more effective portfolio management tools.
The study reveals that multi-turn language agents show limited improvement from self-generated feedback compared to strong external feedback, emphasizing the importance of the student's ability to act on feedback. The controlled evaluation across models like Omni-MATH and Codeforces indicates that feedback must provide specific guidance to enhance performance effectively.
The study highlights that multi-turn language agents benefit more from strong external feedback than from self-generated feedback, indicating that builders and PMs should prioritize developing systems that can provide specific, actionable guidance. For investors, this suggests that products focused on enhancing feedback mechanisms may have a competitive edge in improving AI performance.
LearnStop, a checkpoint stopper for reasoning models, shows task-dependent benefits in early exits. In free-form math tasks like GSM8K with Qwen3-32B, it achieves a +0.157 peak adapt gain, outperforming scalar exits, while scalar rules remain competitive in multiple-choice settings.
The introduction of LearnStop, a checkpoint stopper for reasoning models, highlights the importance of task-dependent strategies in AI performance. Builders and PMs should consider integrating such adaptive mechanisms to optimize model efficiency and effectiveness, while investors may find opportunities in technologies that enhance AI reasoning capabilities, leading to better outcomes in diverse applications.
The Contrastive Reflection framework enhances iterative prompt optimization for LLM agents in information retrieval, improving exact-match accuracy from 51.4% to 60.4% on HotpotQA. By leveraging error-anchored behavioral slices and targeted prompt edits, it ensures validation-driven improvements without regressions, outperforming other methods like MIPROv2 and GEPA.
The development of the Contrastive Reflection framework significantly improves iterative prompt optimization for LLM agents, increasing exact-match accuracy on HotpotQA from 51.4% to 60.4%. This advancement offers builders and PMs a more effective method for enhancing AI performance in information retrieval tasks, which can lead to better user experiences and more reliable applications, attracting investor interest in improved AI capabilities.
BayesBench evaluates LLMs' belief updates in multi-turn conversations, revealing that while scaling improves latent inference, it doesn't consistently enhance downstream predictions. The study assesses seven LLMs (3B-70B) across Bayesian tasks, highlighting a gap between inferring latent structures and rational belief updates.
The development of BayesBench provides a framework for evaluating how large language models (LLMs) update their beliefs during multi-turn conversations, highlighting the limitations of scaling in improving predictive accuracy. This insight is crucial for builders and PMs to refine LLM applications, ensuring they can effectively manage user interactions and expectations in real-world scenarios.
This study explores multi-agent deliberation methods for legal reasoning using Large Language Models (LLMs), revealing that these frameworks can outperform traditional models in specific scenarios. The introduced frameworks, inspired by courtroom procedures, demonstrate comparable performance to baseline LLMs while addressing unique legal cases. The findings suggest that multi-agent systems could significantly enhance AI applications in the legal domain.
The study on multi-agent deliberation in legal reasoning demonstrates that these frameworks can outperform traditional models in specific scenarios, indicating a potential shift in how AI can be applied in the legal domain. Builders and PMs should consider integrating multi-agent systems into their legal tech solutions, while investors may see opportunities in startups leveraging this advanced approach to enhance legal decision-making processes.
RoPoLL, a robust panel of LLM judges, outperforms traditional LLM jury methods by mitigating bias from individual judges, achieving a 19% improvement on cross-dimensional attacks and significantly outperforming Mistral-Large-3 in specific corruption scenarios. It utilizes a geometric median for aggregation, ensuring optimal performance against up to 50% corruption rates.
The development of RoPoLL, a robust panel of LLM judges, is significant as it demonstrates a 19% improvement in mitigating bias and handling corruption in AI systems. This advancement provides builders and PMs with a more reliable framework for evaluating AI performance, while investors can recognize the potential for more trustworthy AI applications in critical sectors.
HASTE, a hierarchical for ML engineering, organizes knowledge into three tiers, achieving a 100% medal rate with tiered loading compared to 62.5% with flat loading. In 22 Kaggle competitions, it reached a 77.3% medal rate using Claude Sonnet 4.6, demonstrating that better knowledge organization can enhance performance while reducing compute costs.
The development of HASTE, a hierarchical multi-agent system for ML engineering, demonstrates that organizing knowledge into structured tiers can significantly enhance performance in machine learning competitions while reducing compute costs. This signals to builders and PMs the importance of knowledge management in AI projects, and to investors, it highlights a promising approach for more efficient and effective ML solutions.
The Neuro-Bayesian-Symbolic Residual Attention Shallow Network (NBS-RASN) offers a novel approach to explainable cybersecurity risk assessment, achieving confidence scores between 0.79 and 0.97 across 20 open-source projects. This shallow network incorporates domain knowledge and causal reasoning, proving that interpretability can coexist with performance, challenging the notion that deep models are necessary for effective learning in high-stakes environments.
The development of the Neuro-Bayesian-Symbolic Residual Attention Shallow Network (NBS-RASN) demonstrates that effective cybersecurity risk assessment can be achieved with explainable models, challenging the reliance on complex deep learning systems. This has practical implications for builders and PMs in creating more interpretable solutions, while investors may see opportunities in companies leveraging such innovative approaches to enhance security without sacrificing transparency.
LabGuard introduces a safety suite that translates natural-language laboratory rules into executable specifications, reducing unsafe events from 39.5% to 23.8%. With a task-scope F1 score of 79.4, it effectively integrates runtime monitors in dynamic lab environments, maintaining intervention rates below 0.5%.
LabGuard's ability to translate natural-language laboratory rules into executable specifications significantly enhances safety in dynamic lab environments, reducing unsafe events by 15.7%. This development is crucial for builders and PMs focused on safety compliance in robotics, while investors may see potential in scalable applications across various industries requiring automated safety protocols.
This study evaluates natural-language-to-Lean formalization, revealing a 29.0-point gap between compilation success (89.5%) and consensus faithfulness (60.5%). The findings suggest that existing models struggle with faithful statement generation, emphasizing the need for separate reporting of formal validity and proof-oriented competence.
The study highlights a significant 29.0-point gap between successful compilation and faithfulness in natural-language-to-Lean formalization, indicating that current AI models may not reliably generate accurate formal statements. Builders and PMs should focus on improving model training for better fidelity in outputs, while investors should consider the implications for the reliability of AI applications in formal verification tasks.
DDIAgents introduces a mechanism-conditioned framework for drug-drug interaction (DDI) prediction, enhancing interpretability and performance. It outperforms traditional models across various benchmarks by reducing irrelevant information and leveraging expert reasoning. This approach showcases the potential of multi-agent systems in organizing heterogeneous biomedical knowledge for adaptive AI4Science applications.
The introduction of DDIAgents for drug-drug interaction prediction highlights the effectiveness of multi-agent systems in biomedical applications, offering improved interpretability and performance. This development signals a shift towards more adaptive AI solutions in healthcare, which could attract investment and drive innovation in drug discovery and safety monitoring.
The study critiques the assumption that users have well-defined preferences, proposing CoPref and CoShop to help users construct preferences through agent interactions. Despite evaluating five models, none achieved over 56% accuracy, highlighting the need for agents to enhance user knowledge rather than just retrieve items.
The development of CoPref and CoShop highlights the need for AI agents to assist users in constructing their preferences rather than merely retrieving items based on assumed preferences. This suggests that builders and PMs should focus on enhancing user engagement and knowledge through interactive AI, while investors should consider the potential for improved personalization in AI products.
This paper introduces an agentic framework for autoformalizing research mathematics using general coding LLMs, outperforming smaller models in Lean 4. The system dynamically extends type definitions and validates them before formalizing theorems, successfully producing machine-checked proofs for 32 PutnamBench problems and five ACM STOC papers.
The introduction of an agentic framework for autoformalizing research mathematics using general coding LLMs signifies a major advancement in automating theorem proving, which can enhance the efficiency of mathematical research and validation processes. For builders and PMs, this development opens opportunities to integrate advanced AI tools into academic and research applications, while investors may see potential for commercialization in educational and AI-driven research platforms.
This study explores how data representation, transformer-based embeddings, and retrieval strategies impact the discovery of simulation models through natural language queries. Results indicate that open-source embedding models perform well, and reranking methods are crucial as query complexity increases, providing a baseline for AI-driven model discovery.
This study highlights the effectiveness of open-source embedding models and the importance of reranking methods in AI-driven model discovery. For builders and PMs, this means they can leverage these insights to improve model retrieval systems, enhancing user experience and efficiency, while investors should note the potential for scalable solutions in AI applications across various industries.
AgentBound introduces a runtime governance framework for autonomous AI agents, ensuring verifiable behavioral oversight through delegated authorization, owner-signed constitutions, and site action contracts. It generates cryptographically verifiable governance receipts, enhancing accountability and allowing independent verification of actions while supporting long-running agents with refreshed governance policies.
AgentBound's introduction of a runtime governance framework for autonomous AI agents allows builders and PMs to implement verifiable oversight mechanisms, enhancing accountability and trust in AI systems. For investors, this development signals a move towards more responsible AI deployment, potentially reducing regulatory risks and increasing the attractiveness of AI solutions in the market.
Janus is a plug-in memory controller for LLMs that selectively updates memory, improving accuracy by 2.7 to 4.6 points across various datasets. It uses a Memory Momentum Trigger to evaluate updates efficiently, preventing the loss of useful knowledge. This method is agnostic to existing updaters, enhancing performance without altering their rules.
The development of Janus, a plug-in memory controller for LLMs, matters because it enhances the accuracy of memory updates by 2.7 to 4.6 points without changing existing systems. This improvement allows builders and PMs to create more reliable AI applications, while investors can recognize its potential for increasing the value of AI products in the market.
This study introduces a novel scenario generation pipeline for Autonomous Driving Systems (ADS) testing, leveraging historical failure records in natural language. By utilizing modular LLM-based synthetic scenario generation, the method produces diverse scenarios compatible with testing constraints, successfully applying it to generate 20 scenarios for the Metadrive simulator using NHTSA ADS crash data.
The introduction of a novel scenario generation pipeline for Autonomous Driving Systems, which utilizes historical failure records to create diverse testing scenarios, is significant for builders and PMs as it enhances the robustness of ADS testing. For investors, this development indicates a potential reduction in testing costs and time, improving the viability of autonomous vehicle technologies in the market.
OpenLife introduces open-world Artificial Life (ALIFE) using autonomous LLM agents with persistent memory and social dynamics, demonstrating emergent behaviors over twelve weeks. The project showcases a shift from reactive to spontaneous activities and the formation of distinct agents with their own income, marking a significant step toward living AI.
The development of OpenLife's autonomous LLM agents with persistent memory signifies a major advancement in creating AI that can exhibit emergent behaviors and social dynamics. This has practical implications for builders and PMs in designing more interactive and adaptive systems, while investors may see potential in applications across gaming, simulation, and AI-driven social platforms.
HyPOLE introduces a novel framework for Multi-Agent Reinforcement Learning (MARL) under partial observability, leveraging hyperproperties and HyperLTL for guidance. Evaluations on SMAC, MessySMAC, and WildFire benchmarks show significant performance improvements over traditional methods, demonstrating the effectiveness of Centralized Training for Decentralized Execution (CTDE) techniques in synthesizing decentralized policies.
The introduction of HyPOLE, a framework for Multi-Agent Reinforcement Learning (MARL) that utilizes hyperproperties for guidance, signifies a substantial advancement in developing decentralized policies under partial observability. This can enhance the efficiency and effectiveness of AI systems in complex environments, making it a critical consideration for builders and investors focused on scalable AI solutions.
This study presents a transformer-based reinforcement learning approach for identifying vulnerabilities in Unmanned Traffic Management (UTM) systems, achieving an 8x improvement in discovery efficiency over expert-guided testing. The proposed framework utilizes attention mechanisms to model system states and generate targeted test scenarios, effectively uncovering critical edge cases missed by traditional methods.
The development of a transformer-based reinforcement learning approach for identifying vulnerabilities in Unmanned Traffic Management systems significantly enhances testing efficiency, achieving an 8x improvement. This is crucial for builders and PMs focused on safety and reliability in autonomous systems, while investors should note the potential for reduced costs and faster deployment of safer UTM solutions.
MultiUAV-Plat introduces a lightweight platform for multi-UAV collaborative task planning, featuring 75 mission sessions and 1500 tasks. The Agent4Drone framework outperforms a ReAct baseline with a 57.9% task pass rate, significantly enhancing LLM-driven UAV autonomy under realistic constraints.
The development of the MultiUAV-Plat platform enhances LLM-driven UAV autonomy, achieving a 57.9% task pass rate in collaborative planning. This improvement signals a significant advancement in multi-UAV applications, presenting opportunities for builders and PMs to develop more efficient drone solutions, while investors may see potential in the growing UAV market.
AgRefactor is an LLM-based workflow that refactors software into HLS-compatible code, achieving a 6.51x speedup over state-of-the-art tools on complex benchmarks. It utilizes a self-evolving memory system to enhance efficiency and scalability, outperforming existing methods on 9 out of 11 challenging real-world cases. Fully automated and open-sourced, it addresses the gap between software and hardware programming practices.
AgRefactor's self-evolving multi-agent workflow can significantly streamline the process of converting software to HLS-compatible code, offering a 6.51x speedup over existing tools. This development is crucial for builders and PMs looking to optimize performance in hardware-software integration, while investors should note its potential to disrupt the software development landscape.
IMCBench introduces a novel benchmark for multimodal large language models (LLMs) in medical conversations, pairing clinical images with synthetic patient profiles. The evaluation of eight models, including Claude Opus 4.6, reveals that while it scores highest overall (3.61), safety concerns persist, particularly for malignant and rare conditions, highlighting the need for multi-dimensional assessment frameworks in medical AI.
The introduction of IMCBench for evaluating multimodal LLMs in medical conversations is significant as it highlights the need for robust assessment frameworks to address safety concerns in AI applications. Builders and PMs should consider integrating such benchmarks to ensure reliability in healthcare AI, while investors may see opportunities in companies that prioritize safety and efficacy in their AI solutions.
This study investigates the vulnerability of LLM-based agents, particularly in multiple-choice question answering, due to memory manipulation. By implementing an external memory component, the research demonstrates that even simple corruptions can significantly alter the agent's responses, leading to incorrect selections despite clean queries. The findings highlight the need for robust memory management in AI systems to mitigate these risks.
The study on memory vulnerabilities in LLM agents reveals that external memory manipulation can lead to incorrect responses in multiple-choice question answering. This underscores the critical need for builders and PMs to prioritize robust memory management in AI systems, as investors should consider the implications for reliability and trustworthiness in AI applications.
The study introduces the 'capability slice' to bridge the gap between model evaluation and data optimization, demonstrating its effectiveness in two case studies. In one, targeted data intervention improved BBH performance by 66.44% without altering the dataset, while in another, a focused sampling strategy enhanced math-reasoning scores from 0.00 to 26.67.
The introduction of the 'capability slice' for model evaluation and data optimization is significant as it demonstrates a way to enhance model performance dramatically without the need for extensive data changes. Builders and PMs can leverage this approach to improve their AI models efficiently, while investors may see it as a signal of advancing methodologies that reduce costs and time in model development.
The Recursive Self-Evolving Agent (RSEA) outperforms existing methods like ReAct on the ALFWorld benchmark, achieving 69.3% accuracy, and 79.4% with retries. RSEA's strict held-out selection ensures it never underperforms compared to the base agent, while unguarded context evolution proves to be high-variance and unsafe across tasks. Overall, no single artifact consistently excels across benchmarks.
The development of Recursive Self-Evolving Agents (RSEA), which outperforms existing methods on the ALFWorld benchmark, highlights the importance of robust selection mechanisms in AI. Builders and PMs should consider integrating such adaptive models to enhance performance while managing risks associated with unguarded evolution, which can lead to inconsistencies across tasks.
ATHENA-R1 is an AI agent for treatment reasoning, outperforming existing models with 94.7% accuracy in drug reasoning and 82.9% in treatment reasoning. Trained using reinforcement learning across 3,168 drug tasks and 456 patient cases, it shows significant improvements over GPT-5 by 17.8 and 10.7 points respectively.
The development of ATHENA-R1, an AI agent achieving 94.7% accuracy in drug reasoning, represents a significant leap in biomedical AI applications. This advancement can lead to more effective treatment plans, making it a critical consideration for builders and PMs in healthcare tech, while investors may find opportunities in the growing market for AI-driven medical solutions.
This study introduces a mechanistic interpretability approach for Large Language Models (LLMs) that enhances OCEAN personality traits through latent feature interventions. By using sparse autoencoders and contrastive activation analysis, the method applies targeted shifts in hidden states, achieving improved personality control while maintaining high performance on standard benchmarks.
The introduction of a mechanistic interpretability approach for LLMs that enhances OCEAN personality traits through latent feature interventions is significant for builders and PMs as it provides a method to create more tailored and engaging AI interactions. For investors, this development signals a potential for improved user experience and retention in AI applications, which could lead to increased market competitiveness.
This study presents a two-period model analyzing how firms should balance AI deployment and worker engagement, revealing that engaging less-skilled workers is cost-effective for fallback purposes, while worker mobility shifts investment towards higher-skilled workers. The model highlights the dual dimensions of AI progress—capability and reliability—and their impact on future skill development.
The study's model highlights the importance of balancing AI deployment with worker engagement, suggesting that firms should invest in both less-skilled workers for fallback and higher-skilled workers due to mobility. This insight is crucial for builders and PMs in workforce planning and for investors in assessing the long-term viability of companies adapting to AI advancements.