DeepSignal tracks AI news from research labs, model companies, developer tools, AI infrastructure, robotics and policy sources. This page updates daily with curated AI signals.

Latest

All recent AI updates, continuously refreshed.

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

arXiv cs.CL·Avisha Das, Mihir Parmar, Mohana Ramnath, Pulkit Verma

12h ago

FeaturedOriginal

Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

AI Summary

The Indi-RomCoM benchmark evaluates LLMs on Romanized Code-Mixed instructions, revealing significant performance drops, especially as code-mixing density increases. LLMs, including proprietary and open-weight models, consistently struggle with RCM tasks, highlighting the need for improved multilingual systems.

Why Featured

The Indi-RomCoM benchmark reveals that current LLMs struggle with code-mixed instructions, indicating a significant gap in their multilingual capabilities. Builders and PMs should focus on developing more robust models for diverse language interactions, while investors may see opportunities in startups addressing this unmet need in the AI language space.

#LLM #Open Source #AI Assistant

arXiv cs.AI·Ramin Pishehvar

12h ago

FeaturedOriginal

A Three-Phase Foundation Model for Tax-Aware Personalized Portfolio Management

AI Summary

This paper introduces a three-phase deep reinforcement learning model for personalized portfolio management, addressing ticker lock-in, monolithic objectives, and static user models. It employs a T5-based time series model for asset encoding, a Mixture of Experts architecture for diverse investment goals, and a personalized inference layer using transaction history, marking a significant advancement in financial AI applications.

Why Featured

The introduction of a three-phase deep reinforcement learning model for personalized portfolio management represents a significant advancement in financial AI, allowing for more tailored investment strategies that adapt to individual user behaviors and goals. This could lead to improved investment performance and customer satisfaction, making it a critical development for builders and PMs in the fintech space, as well as for investors seeking more effective portfolio management tools.

#LLM #Inference #AI Assistant #Enterprise AI

arXiv cs.CL·Mizanur Rahman, Abeer Badawi, Elahe Rahimi, Laleh Seyyed-Kalantari, Frank Rudzicz, Enamul Hoque, Elham Dolatabadi

12h ago

FeaturedOriginal

Training Therapeutic Judges and for Human-Aligned Mental Health Support

AI Summary

The TheraJudge and TheraAgent framework enhances mental health support by aligning therapeutic responses with human evaluations, achieving an ICC of 0.87-0.95 with clinicians. TheraAgent improves therapeutic quality by +0.43 on a 5-point scale, particularly correcting low-quality responses by +2.45 points, demonstrating the efficacy of human-aligned evaluation in large language models.

Why Featured

The development of the TheraJudge and TheraAgent framework, which aligns therapeutic responses with human evaluations and significantly improves therapeutic quality, indicates a growing trend in AI-driven mental health support. Builders and PMs should consider integrating such frameworks into their products to enhance user experience, while investors may see potential in funding mental health tech that leverages human-aligned AI.

#LLM #Agent #AI Assistant

arXiv cs.CL·Alessandro Morosini, Sarah H. Cen, Andrew Ilyas, Hedi Driss, Aleksander M\k{a}dry, Chara Podimata

12h ago

FeaturedOriginal

Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

AI Summary

This study introduces a framework using generative AI agents for black-box audits of personalization algorithms, revealing that X's algorithm amplifies toxic content based on user ideology. The deployment of 1,120 agents across 14 personas collected over 200,000 content exposures, demonstrating significant variations in content delivery influenced by demographic signals.

Why Featured

The introduction of a framework using generative AI agents for black-box audits of personalization algorithms is significant for builders and PMs as it highlights the need for transparency in algorithmic decision-making. Investors should note that the ability to identify biases in content delivery can lead to improved user trust and compliance with regulatory standards, impacting future investments in AI-driven platforms.

#Agent #AI Assistant #Policy

arXiv cs.CL·Yangqiaoyu Zhou, Mohammad Alqudah, Kwei-Herng Lai, Aaron Halfaker, Yingqi Xiong, Yaar Harari

12h ago

FeaturedOriginal

A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

AI Summary

An automated description optimization pipeline for enterprise AI agents reduced engineering effort from 120 minutes to 3.8 minutes while achieving F1 scores of 79.2%, comparable to manually tuned descriptions. The key improvement driver was a single LLM rewrite utilizing false-positive and false-negative cases, highlighting the importance of addressing skill collisions in overlapping descriptions.

Why Featured

The development of an automated description optimization pipeline that reduces engineering effort from 120 minutes to 3.8 minutes while maintaining high F1 scores demonstrates significant efficiency gains in AI deployment. Builders and PMs can leverage this approach to streamline their workflows, while investors should note the potential for cost savings and improved performance in enterprise AI applications.

#LLM #Agent #Enterprise AI

Expanding our Heat Resilience data to 50+ global cities

Google Research

22h ago

Original

Expanding our Heat Resilience data to 50+ global cities

AI Summary

Google Research has expanded its Heat Resilience dataset to over 50 global cities, providing high-resolution rooftop reflectivity data to help urban planners implement cool-roof solutions. This initiative aims to mitigate extreme heat, which causes approximately 500,000 deaths annually, by using AI to analyze satellite imagery for targeted cooling interventions.

Why Featured

Google Research's expansion of its Heat Resilience dataset to over 50 global cities provides builders and PMs with critical data for implementing cool-roof solutions, addressing urban heat challenges. For investors, this initiative signals a growing market for sustainable urban development technologies that can mitigate climate-related risks and improve public health outcomes.

#AI Image #Policy

Introducing TabFM: A zero-shot foundation model for tabular data

Google Research

1d ago

FeaturedOriginal

Introducing TabFM: A zero-shot foundation model for tabular data

AI Summary

Google Research introduces TabFM, a zero-shot foundation model for tabular data, eliminating manual training and hyperparameter tuning. TabFM leverages in-context learning to generate predictions on unseen tables efficiently, outperforming traditional models in benchmarks across 38 classification and 13 regression datasets.

Why Featured

Google Research's introduction of TabFM, a zero-shot foundation model for tabular data, significantly reduces the need for manual training and hyperparameter tuning, enabling builders and PMs to deploy predictive models faster and at lower costs. This advancement could attract investor interest due to its potential to streamline data-driven decision-making across various industries.

#LLM #Open Source #AI Startup

Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction

Google Research

4d ago

FeaturedOriginal

Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction

AI Summary

Google Research has accelerated the Gemini Nano models on Pixel devices by implementing frozen Multi-Token Prediction, significantly enhancing performance. This advancement allows for faster processing and improved efficiency in AI tasks, benefiting developers and users of Pixel devices. The new approach aims to reduce computational costs while maintaining high accuracy in predictions.

Why Featured

Google Research's acceleration of Gemini Nano models on Pixel devices through frozen Multi-Token Prediction enhances processing speed and efficiency, which is crucial for builders and PMs focusing on mobile AI applications. This development signals a reduction in computational costs while maintaining accuracy, making it a compelling opportunity for investors in the AI and mobile tech sectors.

#LLM #AI Coding #Inference #AI Assistant

arXiv cs.CL·Ram Janarthan, Coleman Haley, Sharon Goldwater

12h ago

FeaturedOriginal

When transformers learn "impossible" languages, what do they learn?

AI Summary

Recent research on GPT-2 models reveals that while they show gradual degradation in grammatical sensitivity to 'impossible' languages, they significantly struggle with generative tasks, producing fewer high-quality sentences as length increases. This suggests a link between model behavior and the non-attestation of such languages due to generative deficiencies.

Why Featured

The research on GPT-2 models highlights that their performance degrades with complex, non-attested languages, indicating that generative AI may struggle with tasks requiring high grammatical sensitivity. This has implications for builders and PMs in developing more robust language models and for investors in assessing the viability of AI products in linguistically diverse applications.

#LLM #AI Assistant

arXiv cs.CL·Mattia Tritto, Giuseppe Farano, Dario Di Palma, Gaetano Rossiello, Fedelucio Narducci, Dharmashankar Subramanian, Tommaso Di Noia

12h ago

Original

Test-Time Verification for Text-to-SQL via Outcome Reward Models

AI Summary

This study introduces GradeSQL, a framework utilizing Outcome Reward Models (ORMs) for test-time verification in Text-to-SQL tasks, outperforming traditional methods like Best-of-N sampling and Majority Voting by up to 4.33% on the BIRD benchmark. ORMs enhance semantic scoring for structured query generation, demonstrating scalability and effectiveness, especially for complex queries.

Why Featured

The introduction of GradeSQL, which employs Outcome Reward Models for test-time verification in Text-to-SQL tasks, signifies a notable advancement in query generation accuracy, improving performance by up to 4.33% on the BIRD benchmark. This development is crucial for builders and PMs focusing on database interaction tools, as it enhances the reliability of AI-driven query systems, potentially leading to better user experiences and reduced error rates.

#LLM #AI Coding

arXiv cs.CL·Aaron Bundi Anampiu

12h ago

FeaturedOriginal

Multilingual Polarization Detection Using Transformer-Based Models with Class Weighting and Threshold Tuning

AI Summary

This study presents a transformer-based approach for multilingual polarization detection, achieving F1 macro scores of 0.7901 for English and 0.7910 for Swahili in binary detection. The method employs class-weighted loss functions and threshold tuning to address label imbalance, demonstrating competitive performance in the SemEval-2026 Task 9 leaderboard.

Why Featured

The development of a transformer-based model for multilingual polarization detection with high F1 scores indicates a significant advancement in natural language processing capabilities. This can help builders and PMs create more effective sentiment analysis tools for diverse languages, while investors may see opportunities in products that leverage this technology for content moderation and social media analytics.

#LLM #AI Search #AI Assistant

arXiv cs.CL·Naihao Deng, Yilun Zhu, Joan Nwatu, Clayton Scott, Rada Mihalcea

12h ago

FeaturedOriginal

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

AI Summary

This paper identifies deductive stereotyping in large language models (LLMs), where models make biased inferences based on population statistics. To counteract this, the authors propose Fair-GCG, a framework that enhances fairness-aware reasoning by discovering effective injection phrases, leading to improved performance on fairness benchmarks and real-world tasks.

Why Featured

The development of Fair-GCG addresses deductive stereotyping in LLMs, which can lead to biased outputs in applications. For builders and PMs, adopting this framework can enhance the fairness of their models, improving user trust and compliance with ethical standards, while investors should note its potential to create more responsible AI products that meet growing regulatory demands.

#LLM #Inference #AI Assistant #Policy

arXiv cs.CL·Jingyu Zhang, Xinyi Yan, Yi Xiang, Yingyi Zhang, Chengzhi Zhang

12h ago

Original

Building a Multimodal Dataset of Academic Paper for Keyword Extraction

AI Summary

This study presents a multimodal dataset of 1000 academic papers for keyword extraction, incorporating text, images, and audio. Experiments reveal that combining these modalities significantly enhances keyword extraction performance, highlighting the importance of diverse data sources in model training.

Why Featured

The development of a multimodal dataset for keyword extraction from academic papers demonstrates the potential for improved model performance through diverse data sources. Builders and PMs should consider integrating multimodal approaches in their AI projects to enhance functionality, while investors may see opportunities in startups leveraging such innovative datasets for better research tools.

#AI Video #AI Image #AI Search

arXiv cs.CL·Kazuaki Furumai, Shuichiro Haruta, Kazunori Matsumoto, Daisuke Kamisaka

12h ago

FeaturedOriginal

CORTEX: Token-Level Hallucination Detection in via Comparative Internal Representations

AI Summary

CORTEX is a token-level hallucination detection method for Retrieval-Augmented Generation (RAG) that improves localization of ungrounded content by comparing internal representations of LLMs with and without retrieved documents. Experiments on two RAG benchmarks demonstrate substantial performance gains in detecting hallucinations, reducing false positives and enhancing span consistency.

Why Featured

The development of CORTEX, a token-level hallucination detection method for Retrieval-Augmented Generation, significantly enhances the reliability of AI-generated content by reducing false positives and improving span consistency. This is crucial for builders and PMs focused on deploying trustworthy AI systems, while investors should note its potential to increase user trust and engagement in AI applications.

#LLM #AI Coding #Inference

arXiv cs.AI·Bart{\l}omiej Cupia{\l}, Jan {\L}ojek, Miko{\l}aj Garstecki, Szymon Pob{\l}ocki, Alicja Ziarko, Piotr Mi{\l}o\'s

12h ago

Original

What Drives Interactive Improvement from Feedback?

AI Summary

The study reveals that multi-turn language agents show limited improvement from self-generated feedback compared to strong external feedback, emphasizing the importance of the student's ability to act on feedback. The controlled evaluation across models like Omni-MATH and Codeforces indicates that feedback must provide specific guidance to enhance performance effectively.

Why Featured

The study highlights that multi-turn language agents benefit more from strong external feedback than from self-generated feedback, indicating that builders and PMs should prioritize developing systems that can provide specific, actionable guidance. For investors, this suggests that products focused on enhancing feedback mechanisms may have a competitive edge in improving AI performance.

#LLM #Agent #AI Assistant

arXiv cs.AI·Zhe Dong (University of Maine at Presque Isle), Fang Qin (Stanford University), Manish Shah (Independent Researcher)

12h ago

FeaturedOriginal

When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

AI Summary

LearnStop, a checkpoint stopper for reasoning models, shows task-dependent benefits in early exits. In free-form math tasks like GSM8K with Qwen3-32B, it achieves a +0.157 peak adapt gain, outperforming scalar exits, while scalar rules remain competitive in multiple-choice settings.

Why Featured

The introduction of LearnStop, a checkpoint stopper for reasoning models, highlights the importance of task-dependent strategies in AI performance. Builders and PMs should consider integrating such adaptive mechanisms to optimize model efficiency and effectiveness, while investors may find opportunities in technologies that enhance AI reasoning capabilities, leading to better outcomes in diverse applications.

#LLM #AI Coding #Inference

arXiv cs.AI·Ankur Samanta, Akshayaa Magesh, Tal Lancewicki, Ayush Jain, Youliang Yu, Paul Sajda, Kaveh Hassani, Aditya Modi, Daniel R. Jiang, Yonathan Efroni

12h ago

FeaturedOriginal

BayesBench: Evaluating LLM Belief Trajectories Under Multi-Turn Evidence Accumulation

AI Summary

BayesBench evaluates LLMs' belief updates in multi-turn conversations, revealing that while scaling improves latent inference, it doesn't consistently enhance downstream predictions. The study assesses seven LLMs (3B-70B) across Bayesian tasks, highlighting a gap between inferring latent structures and rational belief updates.

Why Featured

The development of BayesBench provides a framework for evaluating how large language models (LLMs) update their beliefs during multi-turn conversations, highlighting the limitations of scaling in improving predictive accuracy. This insight is crucial for builders and PMs to refine LLM applications, ensuring they can effectively manage user interactions and expectations in real-world scenarios.

#LLM #Inference

arXiv cs.CL·Ashish Hallur, Thomas Thebaud, Georgi Tinchev, Venkatesh Ravichandran, Laureano Moro-Velazquez

12h ago

FeaturedOriginal

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

AI Summary

The study introduces a percentile-based evaluation protocol for speech-to-speech AI agents, using over 4000 hours of conversation data to assess prosody and rhythm. This method improves the calibration of evaluation metrics like $F_0$ expressivity and speech rate, yielding more interpretable results compared to pooled human statistics.

Why Featured

The introduction of a percentile-based evaluation protocol for speech-to-speech AI agents enhances the assessment of prosody and rhythm, allowing builders and PMs to create more natural and engaging dialogue systems. For investors, this advancement signals a potential increase in user satisfaction and market competitiveness in the AI conversational space.

#Agent #Inference #AI Assistant

arXiv cs.CL·Tanvir Ahmed Sijan, S. M Golam Rifat, Nayeemul Islam, Md. Musfique Anwar

12h ago

FeaturedOriginal

Beyond Clean Text: Evaluating Encoder and Decoder Robustness for Bangla Event Detection in Noisy Text

AI Summary

This study evaluates the robustness of Bangla event detection systems using a benchmark of 9,979 sentences across clean and noisy text. Encoder models like BanglaBERT excel in clean conditions but falter under noise, while decoder-only models like Llama 3 show greater resilience, especially with corrupted event triggers. Combining training on clean and noisy data improves encoder performance and narrows the robustness gap.

Why Featured

The study highlights the varying robustness of Bangla event detection models, specifically showing that while encoder models like BanglaBERT perform well in clean text, decoder models like Llama 3 are more resilient to noise. This insight is crucial for builders and PMs developing applications in noisy environments, as it suggests that integrating both model types could enhance overall system reliability.

#LLM #AI Assistant

arXiv cs.AI·Cor Steging, Ludi van Leeuwen, Tadeusz Zbiegie\'n

12h ago

FeaturedOriginal

Investigating Deliberation in Law

AI Summary

This study explores multi-agent deliberation methods for legal reasoning using Large Language Models (LLMs), revealing that these frameworks can outperform traditional models in specific scenarios. The introduced frameworks, inspired by courtroom procedures, demonstrate comparable performance to baseline LLMs while addressing unique legal cases. The findings suggest that multi-agent systems could significantly enhance AI applications in the legal domain.

Why Featured

The study on multi-agent deliberation in legal reasoning demonstrates that these frameworks can outperform traditional models in specific scenarios, indicating a potential shift in how AI can be applied in the legal domain. Builders and PMs should consider integrating multi-agent systems into their legal tech solutions, while investors may see opportunities in startups leveraging this advanced approach to enhance legal decision-making processes.

#LLM #Agent #AI Assistant #Policy

arXiv cs.AI·Anish Acharya, Kris W Pan, Brian Verkhovsky

12h ago

FeaturedOriginal

RoPoLL: Robust Panel of LLM Judges

AI Summary

RoPoLL, a robust panel of LLM judges, outperforms traditional LLM jury methods by mitigating bias from individual judges, achieving a 19% improvement on cross-dimensional attacks and significantly outperforming Mistral-Large-3 in specific corruption scenarios. It utilizes a geometric median for aggregation, ensuring optimal performance against up to 50% corruption rates.

Why Featured

The development of RoPoLL, a robust panel of LLM judges, is significant as it demonstrates a 19% improvement in mitigating bias and handling corruption in AI systems. This advancement provides builders and PMs with a more reliable framework for evaluating AI performance, while investors can recognize the potential for more trustworthy AI applications in critical sectors.

#LLM #AI Assistant

arXiv cs.AI·Yongbin Kim, Yashar Talebirad, Osmar R. Zaiane

12h ago

FeaturedOriginal

Why Solve It Twice? Hierarchical Accumulation of Skills for Transfer-Efficient ML Engineering

AI Summary

HASTE, a hierarchical for ML engineering, organizes knowledge into three tiers, achieving a 100% medal rate with tiered loading compared to 62.5% with flat loading. In 22 Kaggle competitions, it reached a 77.3% medal rate using Claude Sonnet 4.6, demonstrating that better knowledge organization can enhance performance while reducing compute costs.

Why Featured

The development of HASTE, a hierarchical multi-agent system for ML engineering, demonstrates that organizing knowledge into structured tiers can significantly enhance performance in machine learning competitions while reducing compute costs. This signals to builders and PMs the importance of knowledge management in AI projects, and to investors, it highlights a promising approach for more efficient and effective ML solutions.

#Agent #AI Coding #Inference

arXiv cs.CL·M. K. Arabov

12h ago

FeaturedOriginal

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

AI Summary

A new Arabic-Russian benchmark for scientific translation includes a hybrid corpus of 27,000 sentence pairs and fine-tunes models like Qwen2.5-7B, achieving BLEU 23.15. This work facilitates knowledge exchange between Arabic and Russian researchers, supporting sustainable partnerships and innovation.

Why Featured

The development of a new Arabic-Russian benchmark for scientific translation, utilizing a hybrid corpus and fine-tuning models like Qwen2.5-7B, enhances cross-linguistic collaboration in research. This enables builders and PMs to create tools that facilitate knowledge transfer, while investors can identify opportunities in emerging markets focused on multilingual AI applications.

#LLM #Open Source #AI Assistant

arXiv cs.AI·Jingpu Yang, Fengxian Ji, Zhengzhao Lai, Zhexuan Cui, Guangxian Ouyang, Qian Jiang, Fan Zhang, Min Peng, Qianqian Xie, Preslav Nakov, Zhuohan Xie

12h ago

Original

LabGuard: Grounding Natural-Language Laboratory Rules into Runtime Guards for Embodied Laboratory Agents

AI Summary

LabGuard introduces a safety suite that translates natural-language laboratory rules into executable specifications, reducing unsafe events from 39.5% to 23.8%. With a task-scope F1 score of 79.4, it effectively integrates runtime monitors in dynamic lab environments, maintaining intervention rates below 0.5%.

Why Featured

LabGuard's ability to translate natural-language laboratory rules into executable specifications significantly enhances safety in dynamic lab environments, reducing unsafe events by 15.7%. This development is crucial for builders and PMs focused on safety compliance in robotics, while investors may see potential in scalable applications across various industries requiring automated safety protocols.

#Agent #Robotics #AI Assistant

arXiv cs.AI·Zhenqian Shen, Yu Liu, Xiaoyi Fu, Quanming Yao

12h ago

FeaturedOriginal

DDIAgents: Mechanism-Conditioned Context Flow for Drug-Drug Interaction Prediction

AI Summary

DDIAgents introduces a mechanism-conditioned framework for drug-drug interaction (DDI) prediction, enhancing interpretability and performance. It outperforms traditional models across various benchmarks by reducing irrelevant information and leveraging expert reasoning. This approach showcases the potential of multi-agent systems in organizing heterogeneous biomedical knowledge for adaptive AI4Science applications.

Why Featured

The introduction of DDIAgents for drug-drug interaction prediction highlights the effectiveness of multi-agent systems in biomedical applications, offering improved interpretability and performance. This development signals a shift towards more adaptive AI solutions in healthcare, which could attract investment and drive innovation in drug discovery and safety monitoring.

#Agent #AI Coding #AI Startup

Latest

Want this in your inbox every morning?

Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

A Three-Phase Foundation Model for Tax-Aware Personalized Portfolio Management

Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support

Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

Expanding our Heat Resilience data to 50+ global cities

Introducing TabFM: A zero-shot foundation model for tabular data

Accelerating Gemini Nano models on Pixel with frozen Multi-Token Prediction

When transformers learn "impossible" languages, what do they learn?

Test-Time Verification for Text-to-SQL via Outcome Reward Models

Multilingual Polarization Detection Using Transformer-Based Models with Class Weighting and Threshold Tuning

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

Building a Multimodal Dataset of Academic Paper for Keyword Extraction

CORTEX: Token-Level Hallucination Detection in RAG via Comparative Internal Representations

What Drives Interactive Improvement from Feedback?

When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

BayesBench: Evaluating LLM Belief Trajectories Under Multi-Turn Evidence Accumulation

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

Beyond Clean Text: Evaluating Encoder and Decoder Robustness for Bangla Event Detection in Noisy Text

Investigating Multi-Agent Deliberation in Law

RoPoLL: Robust Panel of LLM Judges

Why Solve It Twice? Hierarchical Accumulation of Skills for Transfer-Efficient ML Engineering

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

LabGuard: Grounding Natural-Language Laboratory Rules into Runtime Guards for Embodied Laboratory Agents

DDIAgents: Mechanism-Conditioned Context Flow for Drug-Drug Interaction Prediction

Training Therapeutic Judges and for Human-Aligned Mental Health Support

CORTEX: Token-Level Hallucination Detection in via Comparative Internal Representations

Investigating Deliberation in Law