Dream at SemEval-2026 Task 13: SALSA for Single-Pass Machine-Generated Code Detection

arXiv cs.CL·Ruslan Berdichevsky, Shai Nahum-Gefen, Elad Ben-Zaken

15h ago

·~1 min·6/25/2026·en·0

Quick Answer

This paper shows that The SemEval-2026 Task 13 introduces SALSA, a single-pass autoregressive LLM for detecting machine-generated code, achieving an OOD F1 score of 0.789, significantly surpassing CodeBERT's 0.305.

Quick Take

The SemEval-2026 Task 13 introduces SALSA, a single-pass autoregressive LLM for detecting machine-generated code, achieving an OOD F1 score of 0.789, significantly surpassing CodeBERT's 0.305. This method emphasizes OOD generalization and avoids overfitting through balanced sampling and conservative training techniques.

Key Points

SALSA maps each class to a dedicated output token for structured classification.
The model is designed to emit a single-token label without hand-crafted features.
Balanced sampling and low learning rate training enhance OOD robustness.
Best system outperformed CodeBERT by a significant margin on the leaderboard.
Focus on unseen programming languages and application domains for broader applicability.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2606. 25102v1 Announce Type: new Abstract: Large language models have transformed code generation, raising concerns around authorship, assessment integrity, and software trust. SemEval-2026 Task 13 Subtask A operationalizes detection as binary classification over code snippets, with a particular emphasis on out-of-distribution (OOD) generalization across unseen programming languages and application domains.

We propose a SALSA-style formulation, Single-pass Autoregressive LLM Structured Classification, that maps each class to a dedicated output token and trains the model to emit a single-token label in a structured response. Rather than engineering hand-crafted features or decision rules, this formulation delegates the authorship decision to the model.

To improve OOD robustness, we combine balanced sampling across languages with parameter-efficient fine-tuning and conservative training (low learning rate, single epoch) to avoid overfitting to the training domain. Our best system achieves OOD $F_1 = 0. 789$ on the official leaderboard, substantially outperforming the CodeBERT baseline ($F_1 = 0. 305$).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Barak Or

1d ago

FeaturedOriginal

Quantifying Prior Dominance in Systems

AI Summary

The study introduces the Normalized Context Utilization (NCU) metric to evaluate Retrieval-Augmented Generation (RAG) systems, revealing that Small Language Models (SLMs) outperform larger models in factual extraction. The findings indicate that traditional scaling laws yield diminishing returns, with a commercial API frequently failing against adversarial evidence due to systemic confidence collapse.

#LLM #AI Coding #Inference #AI Startup

Dream at SemEval-2026 Task 13: SALSA for Single-Pass Machine-Generated Code Detection

Quick Answer

Quick Take

Key Points

Paper Resources

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quick Answer

Quick Take

Key Points

Paper Resources

Article Excerpt

Want this in your inbox every morning?

More from arXiv cs.CL

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

When Plausible Is Not Realistic: Evaluating Human Mobility in LLM-Based Urban Simulation

Quantifying Prior Dominance in Systems