Sch\"utzen: Evaluating LLM Safety in Bulgarian and German Contexts

arXiv cs.CL·Kiril Georgiev, Yuxia Wang, Dimitar Iliyanov Dimitrov, Preslav Nakov, Ivan Koychev

2d ago

·~1 min·6/11/2026·en·0

Quick Answer

This paper shows that The Schützen dataset introduces a German-Bulgarian safety evaluation resource for LLMs, addressing the lack of multilingual safety datasets.

Quick Take

The Schützen dataset introduces a German-Bulgarian safety evaluation resource for LLMs, addressing the lack of multilingual safety datasets. Experiments show significant cross-language safety behavior differences, emphasizing the need for tailored evaluation tools for responsible LLM deployment in diverse sociocultural contexts.

Key Points

Schützen targets safety evaluation for LLMs in German and Bulgarian contexts.
Existing safety datasets are predominantly focused on English and Chinese languages.
Multilingual LLMs exhibit notable differences in safety behavior across languages.
The dataset aims to support responsible LLM deployment in Germany and Bulgaria.
Code and datasets are publicly available on GitHub.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Excerpt

From source RSS / original summary

arXiv:2606. 11316v1 Announce Type: new Abstract: Large language models are increasingly deployed across professional domains, bringing hard-to-predict risks, including the generation of harmful or disrespectful content. Although substantial progress has been made in developing safety evaluation datasets, existing resources remain overwhelmingly English- and Chinese-centric. This limitation is particularly pronounced when evaluating languages that operate within shared sociocultural, legal, and ethical contexts.

To address this gap, we introduce Sch\"{u}tzen: a German--Bulgarian safety dataset designed to assess model answerability under risk, covering both a low-resource language (Bulgarian) and a high-resource language (German). Experiments with multilingual and language-specific LLMs reveal pronounced cross-language differences in safety behavior, highlighting the necessity of tailored, region-specific evaluation resources to support the responsible deployment of LLMs in Germany and Bulgaria.

Datasets and code are available at https://github. com/xnlp-lab/Schutzen. Warning: this paper contains examples that may be offensive, harmful, or biased.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Leyao Wang, Yanan He, Peng Chen, Asaf Yehudai, Yixin Liu, Rex Ying, Michal Shmueli-Scheuer, Arman Cohan

3w ago

FeaturedOriginal

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

AI Summary

The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.

#LLM #Agent #Inference #Policy