Want Better Synthetic Data? Steer It: Activation Steering for Low-Resource Language Generation
Quick Answer
This study introduces activation steering for generating synthetic data in low-resource languages, enhancing diversity and downstream performance.
Quick Take
This study introduces activation steering for generating synthetic data in low-resource languages, enhancing diversity and downstream performance. Evaluating four open-source LLMs, the authors find that early-layer steering improves sentiment and topic classification tasks, outperforming traditional few-shot prompting methods.
Key Points
- Activation steering improves synthetic data generation for low-resource languages.
- Two strategies: Language Steering for linguistic identity and Quality Steering for well-formedness.
- Evaluated on four open-source LLMs across 11 diverse languages.
- Early-layer steering enhances data diversity and downstream model performance.
- Results show significant improvements in sentiment and topic classification tasks.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 18389v1 Announce Type: new Abstract: Large language models (LLMs) have become an effective tool for synthetic data generation, including for low-resource languages, where generated data can improve downstream task performance. Current best-performing approaches typically rely on few-shot prompting with target-language examples, which increases inference costs and may reduce diversity through lexical anchoring.
In this work, we investigate activation steering as an alternative for low-resource synthetic data generation. We study two steering strategies: Language Steering, which targets the linguistic identity of a language, and Quality Steering, which captures well-formedness by contrasting human-written and backtranslated text representations.
We evaluate these methods across four open-source LLMs, multiple layers, and 11 typologically diverse languages by generating sentiment and topic classification data and finetuning smaller classifiers. Steering is applied in both zero-shot and few-shot prompting settings and compared against non-steered counterparts. Our results show that steering on early layers consistently improves the diversity of generated data while often yielding stronger downstream model performance, particularly for low-resource languages.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.