Can You Trust What You See? Human and AI Detection of Synthetic Legal Evidence

arXiv cs.CV·Jinzhe Tan, Ali Ekber Cinar, Karim Benyekhlef

2h ago

·~2 min·6/9/2026·en·0

Quick Answer

Quick Take

A study reveals that humans achieved only 64.8% accuracy in distinguishing authentic legal images from AI-generated ones, with MLLMs like GPT-5.1 showing 100% specificity but low detection rates (5.9%) for harder synthetic outputs. This indicates that both humans and AI cannot reliably authenticate visual evidence, necessitating a combined approach for legal proceedings.

Key Points

SLED-1400 dataset includes 200 authentic and 1,200 synthetic legal images.
Human participants performed poorly, with accuracy rates near chance for strong AI generators.
MLLMs maintained 100% specificity but struggled with harder synthetic outputs.
Errors between humans and MLLMs were largely uncorrelated, indicating distinct detection challenges.
Visual evidence should be considered contestable in legal contexts, requiring trained reviews.

Article Content

From source RSS / original summary

arXiv:2606. 07613v1 Announce Type: new Abstract: Visual evidence has long been treated as a reliable form of legal proof, but advances in artificial intelligence (AI) are undermining that assumption. This article asks how well humans and frontier multimodal large language models (MLLMs) can distinguish authentic evidentiary photographs from AI-generated counterparts in the object-centric scenarios typical of civil disputes.

We built Synthetic Legal Evidence Detection (SLED-1400), a dataset of 200 authentic evidence images paired with 1,200 synthetic counterparts produced by six contemporary text-to-image generators across ten evidence categories. The same stimuli and response format were used in a controlled web experiment with 136 lay participants and in a standardized evaluation of four MLLMs (GPT-5. 1, Gemini-3-Pro, Gemini-3-Flash, Qwen3-VL-235B). Human accuracy was 64. 8% overall, and 48. 5% and 51.

0% on the two strongest generators (Gemini-3-Pro-Image and Flux-2-Max), indistinguishable from chance. MLLMs never misclassified an authentic image (100% specificity), but missed most synthetic outputs from the harder generators, with average MLLM detection at 5. 9% on Gemini-3-Pro-Image outputs. Human and MLLM errors were largely uncorrelated, while the four MLLMs were strongly correlated with each other. Neither group is a reliable standalone authenticator.

We argue that visual evidence in legal proceedings should be treated as inherently contestable, and that a workable procedural response must combine trained human review, MLLM screening, and provenance infrastructure such as C2PA Content Credentials.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

4d ago

FeaturedOriginal

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

AI Summary

A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.

#LLM #Agent #Inference #AI Startup