Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
Quick Take
The study reveals that wrapping untrusted inputs in mock tool calls does not enhance robustness in large language models, with increased attack success rates observed in binary evaluation tasks like GSM8K grading. This contradicts the expected trustworthiness hierarchy, suggesting a need for better training methods or new primitives for handling untrusted inputs.
Key Points
- Mock tool wrapping does not improve model robustness against untrusted inputs.
- In GSM8K grading, attack success rates increased instead of decreasing.
- Results vary by model, with no consistent improvements across tested systems.
- The findings challenge the existing instruction hierarchy of trustworthiness.
- Future work should focus on stronger training methods for untrusted input handling.
Article Content
From source RSS / original summaryarXiv:2605. 30521v1 Announce Type: new Abstract: Large language models must frequently process untrusted inputs, such as judging an answer from another model or running tasks like spam and harm classifiers while under adversarial pressure. These inputs are often string-formatted directly into a prompt template, leaving systems fragile to manipulation.
Current LLM specs from major providers like OpenAI distinguish trustworthiness along an Instruction Hierarchy, from System messages (most trusted) to Tool Results (least trusted). A possible natural mitigation is to wrap untrusted content in a mock tool call as a quarantine. We explore this hypothesis with an automated redteaming search over static attack strings across seven models and three LLM-as-a-Judge tasks. Counter to our hypothesis, tool-wrapping does not broadly improve robustness.
On a binary evaluation task (GSM8K grading) it typically increases attack success rates, an apparent inversion of the instruction hierarchy. On scalar and pairwise tasks the effect is smaller and model-dependent, with no tested model reliably helped, and several showing inversion. We recommend evaluating this limitation in deployed systems, and longer-term, pursuing stronger Instruction Hierarchy training or new untrusted-input primitives.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.