Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts
Quick Answer
This paper shows that The Schützen dataset introduces a German-Bulgarian safety evaluation resource for LLMs, addressing the lack of multilingual safety datasets.
Quick Take
The Schützen dataset introduces a German-Bulgarian safety evaluation resource for LLMs, addressing the lack of multilingual safety datasets. Experiments show significant cross-language safety behavior differences, emphasizing the need for tailored evaluation tools for responsible LLM deployment in diverse sociocultural contexts.
Key Points
- Schützen targets safety evaluation for LLMs in German and Bulgarian contexts.
- Existing safety datasets are predominantly focused on English and Chinese languages.
- Multilingual LLMs exhibit notable differences in safety behavior across languages.
- The dataset aims to support responsible LLM deployment in Germany and Bulgaria.
- Code and datasets are publicly available on GitHub.
Paper Resources
Article Excerpt
From source RSS / original summaryarXiv:2606. 11316v1 Announce Type: new Abstract: Large language models are increasingly deployed across professional domains, bringing hard-to-predict risks, including the generation of harmful or disrespectful content. Although substantial progress has been made in developing safety evaluation datasets, existing resources remain overwhelmingly English- and Chinese-centric. This limitation is particularly pronounced when evaluating languages that operate within shared sociocultural, legal, and ethical contexts.
To address this gap, we introduce Sch\"{u}tzen: a German--Bulgarian safety dataset designed to assess model answerability under risk, covering both a low-resource language (Bulgarian) and a high-resource language (German). Experiments with multilingual and language-specific LLMs reveal pronounced cross-language differences in safety behavior, highlighting the necessity of tailored, region-specific evaluation resources to support the responsible deployment of LLMs in Germany and Bulgaria.
Datasets and code are available at https://github. com/xnlp-lab/Schutzen. Warning: this paper contains examples that may be offensive, harmful, or biased.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.