arXiv cs.CL

https://arxiv.org/list/cs.CL/recent

Latest AI signals from arXiv cs.CL

AI research updates from arXiv cs.CL, filtered for LLMs, agents, benchmarks and NLP infrastructure with plain-English summaries and signal scores.

DeepSignal tracks AI updates from arXiv cs.CL, filtering research and product signals into plain-English summaries, signal scores and source-linked article pages.

Current topics: Research, LLM, AI Assistant, Inference, AI Coding

High-signal updates

A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization78 signal
Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG78 signal
CORTEX: Token-Level Hallucination Detection in RAG via Comparative Internal Representations78 signal

arXiv cs.CL·Avisha Das, Mihir Parmar, Mohana Ramnath, Pulkit Verma

10h ago

FeaturedOriginal

Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

AI Summary

The Indi-RomCoM benchmark evaluates LLMs on Romanized Code-Mixed instructions, revealing significant performance drops, especially as code-mixing density increases. LLMs, including proprietary and open-weight models, consistently struggle with RCM tasks, highlighting the need for improved multilingual systems.

Why Featured

The Indi-RomCoM benchmark reveals that current LLMs struggle with code-mixed instructions, indicating a significant gap in their multilingual capabilities. Builders and PMs should focus on developing more robust models for diverse language interactions, while investors may see opportunities in startups addressing this unmet need in the AI language space.

#LLM #Open Source #AI Assistant

2

arXiv cs.CL·Mizanur Rahman, Abeer Badawi, Elahe Rahimi, Laleh Seyyed-Kalantari, Frank Rudzicz, Enamul Hoque, Elham Dolatabadi

10h ago

FeaturedOriginal

Training Therapeutic Judges and for Human-Aligned Mental Health Support

AI Summary

The TheraJudge and TheraAgent framework enhances mental health support by aligning therapeutic responses with human evaluations, achieving an ICC of 0.87-0.95 with clinicians. TheraAgent improves therapeutic quality by +0.43 on a 5-point scale, particularly correcting low-quality responses by +2.45 points, demonstrating the efficacy of human-aligned evaluation in large language models.

Why Featured

The development of the TheraJudge and TheraAgent framework, which aligns therapeutic responses with human evaluations and significantly improves therapeutic quality, indicates a growing trend in AI-driven mental health support. Builders and PMs should consider integrating such frameworks into their products to enhance user experience, while investors may see potential in funding mental health tech that leverages human-aligned AI.

#LLM #Agent #AI Assistant

3

arXiv cs.CL·Alessandro Morosini, Sarah H. Cen, Andrew Ilyas, Hedi Driss, Aleksander M\k{a}dry, Chara Podimata

10h ago

FeaturedOriginal

Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

AI Summary

This study introduces a framework using generative AI agents for black-box audits of personalization algorithms, revealing that X's algorithm amplifies toxic content based on user ideology. The deployment of 1,120 agents across 14 personas collected over 200,000 content exposures, demonstrating significant variations in content delivery influenced by demographic signals.

Why Featured

The introduction of a framework using generative AI agents for black-box audits of personalization algorithms is significant for builders and PMs as it highlights the need for transparency in algorithmic decision-making. Investors should note that the ability to identify biases in content delivery can lead to improved user trust and compliance with regulatory standards, impacting future investments in AI-driven platforms.

#Agent #AI Assistant #Policy

2

arXiv cs.CL·Zhichao Yang, Caiqi Zhang, Ruihan Yang, Chengzu Li, Nigel Collier, Deqing Yang

10h ago

FeaturedOriginal

When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs

AI Summary

The study introduces ACE, an accuracy-controlled evaluation framework for fair comparison of LLMs, revealing that raw global calibration metrics often misrepresent model performance. It shows that many models favored by these metrics lose their advantage when accuracy is considered, highlighting the need for accuracy-aware evaluation in LLM calibration comparisons.

Why Featured

The introduction of the ACE framework for accuracy-controlled evaluation of LLMs is significant because it reveals that traditional calibration metrics can be misleading. Builders and PMs should adopt this framework to ensure they are selecting models based on true performance, while investors should recognize the importance of accuracy-aware evaluations in assessing the viability of AI models in the market.

#LLM #Policy

0

arXiv cs.CL·Yangqiaoyu Zhou, Mohammad Alqudah, Kwei-Herng Lai, Aaron Halfaker, Yingqi Xiong, Yaar Harari

10h ago

FeaturedOriginal

A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

AI Summary

An automated description optimization pipeline for enterprise AI agents reduced engineering effort from 120 minutes to 3.8 minutes while achieving F1 scores of 79.2%, comparable to manually tuned descriptions. The key improvement driver was a single LLM rewrite utilizing false-positive and false-negative cases, highlighting the importance of addressing skill collisions in overlapping descriptions.

Why Featured

The development of an automated description optimization pipeline that reduces engineering effort from 120 minutes to 3.8 minutes while maintaining high F1 scores demonstrates significant efficiency gains in AI deployment. Builders and PMs can leverage this approach to streamline their workflows, while investors should note the potential for cost savings and improved performance in enterprise AI applications.

#LLM #Agent #Enterprise AI

2

arXiv cs.CL·Ram Janarthan, Coleman Haley, Sharon Goldwater

10h ago

FeaturedOriginal

When transformers learn "impossible" languages, what do they learn?

AI Summary

Recent research on GPT-2 models reveals that while they show gradual degradation in grammatical sensitivity to 'impossible' languages, they significantly struggle with generative tasks, producing fewer high-quality sentences as length increases. This suggests a link between model behavior and the non-attestation of such languages due to generative deficiencies.

Why Featured

The research on GPT-2 models highlights that their performance degrades with complex, non-attested languages, indicating that generative AI may struggle with tasks requiring high grammatical sensitivity. This has implications for builders and PMs in developing more robust language models and for investors in assessing the viability of AI products in linguistically diverse applications.

#LLM #AI Assistant

0

arXiv cs.CL·Mattia Tritto, Giuseppe Farano, Dario Di Palma, Gaetano Rossiello, Fedelucio Narducci, Dharmashankar Subramanian, Tommaso Di Noia

10h ago

Original

Test-Time Verification for Text-to-SQL via Outcome Reward Models

AI Summary

This study introduces GradeSQL, a framework utilizing Outcome Reward Models (ORMs) for test-time verification in Text-to-SQL tasks, outperforming traditional methods like Best-of-N sampling and Majority Voting by up to 4.33% on the BIRD benchmark. ORMs enhance semantic scoring for structured query generation, demonstrating scalability and effectiveness, especially for complex queries.

Why Featured

The introduction of GradeSQL, which employs Outcome Reward Models for test-time verification in Text-to-SQL tasks, signifies a notable advancement in query generation accuracy, improving performance by up to 4.33% on the BIRD benchmark. This development is crucial for builders and PMs focusing on database interaction tools, as it enhances the reliability of AI-driven query systems, potentially leading to better user experiences and reduced error rates.

#LLM #AI Coding

0

arXiv cs.CL·Aaron Bundi Anampiu

10h ago

FeaturedOriginal

Multilingual Polarization Detection Using Transformer-Based Models with Class Weighting and Threshold Tuning

AI Summary

This study presents a transformer-based approach for multilingual polarization detection, achieving F1 macro scores of 0.7901 for English and 0.7910 for Swahili in binary detection. The method employs class-weighted loss functions and threshold tuning to address label imbalance, demonstrating competitive performance in the SemEval-2026 Task 9 leaderboard.

Why Featured

The development of a transformer-based model for multilingual polarization detection with high F1 scores indicates a significant advancement in natural language processing capabilities. This can help builders and PMs create more effective sentiment analysis tools for diverse languages, while investors may see opportunities in products that leverage this technology for content moderation and social media analytics.

#LLM #AI Search #AI Assistant

0

arXiv cs.CL·Naihao Deng, Yilun Zhu, Joan Nwatu, Clayton Scott, Rada Mihalcea

10h ago

FeaturedOriginal

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

AI Summary

This paper identifies deductive stereotyping in large language models (LLMs), where models make biased inferences based on population statistics. To counteract this, the authors propose Fair-GCG, a framework that enhances fairness-aware reasoning by discovering effective injection phrases, leading to improved performance on fairness benchmarks and real-world tasks.

Why Featured

The development of Fair-GCG addresses deductive stereotyping in LLMs, which can lead to biased outputs in applications. For builders and PMs, adopting this framework can enhance the fairness of their models, improving user trust and compliance with ethical standards, while investors should note its potential to create more responsible AI products that meet growing regulatory demands.

#LLM #Inference #AI Assistant #Policy

0

arXiv cs.CL·Christopher W. Karvetski, Sheldon S. Huang, Simas Ku\v{c}inskas, Nadja Flechner, Jingyu Hu, Philip Tetlock, Ezra Karger

10h ago

FeaturedOriginal

Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments

AI Summary

The study introduces Explanation Quality Markers (EQMs), a set of sixty reasoning patterns evaluated by large language models, which predict forecasting accuracy better than traditional methods. Analyzing over 55,000 forecast-rationale pairs, EQMs outperform pre-LLM text-analysis techniques and provide a scalable way to assess judgment quality in natural-language explanations.

Why Featured

The introduction of Explanation Quality Markers (EQMs) provides a scalable method to assess the judgment quality of natural-language explanations, which can enhance decision-making in AI applications. For builders and PMs, integrating EQMs could improve forecasting models, while investors might see this as a signal of more reliable AI-driven insights in various markets.

#LLM #AI Assistant

0

arXiv cs.CL·Jingyu Zhang, Xinyi Yan, Yi Xiang, Yingyi Zhang, Chengzhi Zhang

10h ago

Original

Building a Multimodal Dataset of Academic Paper for Keyword Extraction

AI Summary

This study presents a multimodal dataset of 1000 academic papers for keyword extraction, incorporating text, images, and audio. Experiments reveal that combining these modalities significantly enhances keyword extraction performance, highlighting the importance of diverse data sources in model training.

Why Featured

The development of a multimodal dataset for keyword extraction from academic papers demonstrates the potential for improved model performance through diverse data sources. Builders and PMs should consider integrating multimodal approaches in their AI projects to enhance functionality, while investors may see opportunities in startups leveraging such innovative datasets for better research tools.

#AI Video #AI Image #AI Search

0

arXiv cs.CL·Ziling Chen, Chengzhi Zhang, Heng Zhang, Yi Zhao, Chen Yang, Yang Yang

10h ago

Original

Exploring the relationship between team institutional composition and novelty in academic papers based on fine-grained knowledge entities

AI Summary

This study reveals that mixed academic and industrial teams in natural language processing produce more novel papers than purely industrial teams, emphasizing the importance of institutional composition in academic novelty. The research identifies specific knowledge entity combinations that contribute to paper novelty, providing insights for enhancing collaboration and paper quality.

Why Featured

The study highlights that mixed academic and industrial teams in NLP generate more innovative research outputs, indicating that fostering diverse institutional collaborations can enhance product development and research quality. Builders and PMs should consider structuring teams to include both academic and industry experts to drive novelty in their projects.

#LLM #AI Assistant

0

arXiv cs.CL·Guangsheng Bao, Lihua Rong, Yanbin Zhao, Xiao Yu, Qiji Zhou, Yue Zhang

10h ago

FeaturedOriginal

Triospect: A Three-Dimensional Framework for Robust Statistical AI-Generated Text Detection Against Diverse Attacks

AI Summary

The Triospect Detection Framework enhances AI-generated text detection by incorporating content and expression perspectives, achieving significant improvements in robustness against 17 attack types. It outperformed strong baselines by 22.3% (AUROC) and 13% (TPR01) on the Humanize-16K dataset, and 9.1% (AUROC) and 22% (TPR01) on the adversarial RAID. This framework sets a new standard for statistical detection methods.

Why Featured

The Triospect Detection Framework significantly enhances the robustness of AI-generated text detection against various attacks, improving performance metrics by up to 22.3%. For builders and PMs, this development indicates a stronger foundation for developing applications that require reliable content verification, while investors should note its potential to address growing concerns around misinformation and content authenticity.

#Security #AI Assistant

0

arXiv cs.CL·Kazuaki Furumai, Shuichiro Haruta, Kazunori Matsumoto, Daisuke Kamisaka

10h ago

FeaturedOriginal

CORTEX: Token-Level Hallucination Detection in via Comparative Internal Representations

AI Summary

CORTEX is a token-level hallucination detection method for Retrieval-Augmented Generation (RAG) that improves localization of ungrounded content by comparing internal representations of LLMs with and without retrieved documents. Experiments on two RAG benchmarks demonstrate substantial performance gains in detecting hallucinations, reducing false positives and enhancing span consistency.

Why Featured

The development of CORTEX, a token-level hallucination detection method for Retrieval-Augmented Generation, significantly enhances the reliability of AI-generated content by reducing false positives and improving span consistency. This is crucial for builders and PMs focused on deploying trustworthy AI systems, while investors should note its potential to increase user trust and engagement in AI applications.

#LLM #AI Coding #Inference

0

arXiv cs.CL·Ashish Hallur, Thomas Thebaud, Georgi Tinchev, Venkatesh Ravichandran, Laureano Moro-Velazquez

10h ago

FeaturedOriginal

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

AI Summary

The study introduces a percentile-based evaluation protocol for speech-to-speech AI agents, using over 4000 hours of conversation data to assess prosody and rhythm. This method improves the calibration of evaluation metrics like $F_0$ expressivity and speech rate, yielding more interpretable results compared to pooled human statistics.

Why Featured

The introduction of a percentile-based evaluation protocol for speech-to-speech AI agents enhances the assessment of prosody and rhythm, allowing builders and PMs to create more natural and engaging dialogue systems. For investors, this advancement signals a potential increase in user satisfaction and market competitiveness in the AI conversational space.

#Agent #Inference #AI Assistant

0

arXiv cs.CL·Tanvir Ahmed Sijan, S. M Golam Rifat, Nayeemul Islam, Md. Musfique Anwar

10h ago

FeaturedOriginal

Beyond Clean Text: Evaluating Encoder and Decoder Robustness for Bangla Event Detection in Noisy Text

AI Summary

This study evaluates the robustness of Bangla event detection systems using a benchmark of 9,979 sentences across clean and noisy text. Encoder models like BanglaBERT excel in clean conditions but falter under noise, while decoder-only models like Llama 3 show greater resilience, especially with corrupted event triggers. Combining training on clean and noisy data improves encoder performance and narrows the robustness gap.

Why Featured

The study highlights the varying robustness of Bangla event detection models, specifically showing that while encoder models like BanglaBERT perform well in clean text, decoder models like Llama 3 are more resilient to noise. This insight is crucial for builders and PMs developing applications in noisy environments, as it suggests that integrating both model types could enhance overall system reliability.

#LLM #AI Assistant

0

arXiv cs.CL·M. K. Arabov

10h ago

FeaturedOriginal

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

AI Summary

A new Arabic-Russian benchmark for scientific translation includes a hybrid corpus of 27,000 sentence pairs and fine-tunes models like Qwen2.5-7B, achieving BLEU 23.15. This work facilitates knowledge exchange between Arabic and Russian researchers, supporting sustainable partnerships and innovation.

Why Featured

The development of a new Arabic-Russian benchmark for scientific translation, utilizing a hybrid corpus and fine-tuning models like Qwen2.5-7B, enhances cross-linguistic collaboration in research. This enables builders and PMs to create tools that facilitate knowledge transfer, while investors can identify opportunities in emerging markets focused on multilingual AI applications.

#LLM #Open Source #AI Assistant

1

arXiv cs.CL·Daniela Teodorescu, Saif M. Mohammad, Alona Fyshe

10h ago

Original

AI Summary

This study introduces memory-managed long-context attention, which separates fast processing from editable memory slots. A 2.74M-parameter model achieved 595/600 accuracy with minimal supervision, highlighting the need for controlled slot lifecycles and sparse fallback mechanisms in long-context language models.

Why Featured

The introduction of memory-managed long-context attention with editable memory slots allows for improved efficiency and accuracy in language models, as demonstrated by a 2.74M-parameter model achieving 595/600 accuracy. This development signals to builders and PMs the potential for creating more responsive AI applications that can handle complex tasks with controlled memory management, which could attract investor interest in scalable AI solutions.

#LLM #AI Coding

0

arXiv cs.CL

Latest AI signals from arXiv cs.CL

Indi-RomCoM: Code-Mixed Benchmark for Evaluating LLMs on Romanized Indic-English Instructions

Training Therapeutic Judges and Multi-Agent Systems for Human-Aligned Mental Health Support

Using AI Agents to Automate Black-Box Audits of Personalization Algorithms at Scale

When Calibration Rankings Reverse: Accuracy-Controlled Evaluation for Fair Comparison of LLMs

A Single Rewrite Suffices: Empirical Lessons from Production Skill Description Optimization

When transformers learn "impossible" languages, what do they learn?

Test-Time Verification for Text-to-SQL via Outcome Reward Models

Multilingual Polarization Detection Using Transformer-Based Models with Class Weighting and Threshold Tuning

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments

Building a Multimodal Dataset of Academic Paper for Keyword Extraction

Exploring the relationship between team institutional composition and novelty in academic papers based on fine-grained knowledge entities

Triospect: A Three-Dimensional Framework for Robust Statistical AI-Generated Text Detection Against Diverse Attacks

CORTEX: Token-Level Hallucination Detection in RAG via Comparative Internal Representations

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

Beyond Clean Text: Evaluating Encoder and Decoder Robustness for Bangla Event Detection in Noisy Text

Bridging Scientific Heritage: An Arabic--Russian Parallel Corpus and LLM Benchmark for Sustainable Knowledge Transfer

Linguistic Distancing on Social Media: Indicators of Emotion Regulation Across Age Groups

TAG-DLM: Diffusion Language Models for Text-Attributed Graph Learning

What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR

When Reranking Hurts: Uncertainty-Based Gating for Few-Shot Reranking

SeKV: Resolution-Adaptive KV Cache with Hierarchical Semantic Memory for Long-Context LLM Inference

Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies

Legal Domain Adaptation of Modern BERT Models

Labeling Training Data for Entity Matching Using Large Language Models

5ting at SemEval-2026 Task 8: Strong End-to-End Multi-Turn RAG via LLM-Based Reranking and Faithfulness Control

Depth-Staggered Fibonacci Spacing for Sparse Attention: Static Schedules Beat Learned Dilation and Extrapolate Where Dense Attention Fails

AnTenA: Actionable and Explainable Tensor Analysis System with Large Language Models

Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution

Memory-Managed Long-Context Attention: A Preliminary Study of Editable Request-Local Memory

Training Therapeutic Judges and for Human-Aligned Mental Health Support

CORTEX: Token-Level Hallucination Detection in via Comparative Internal Representations

5ting at SemEval-2026 Task 8: Strong End-to-End Multi-Turn via LLM-Based Reranking and Faithfulness Control