Can You Trust What You See? Human and AI Detection of Synthetic Legal Evidence
Quick Answer
A study reveals that humans achieved only 64.8% accuracy in distinguishing authentic legal images from AI-generated ones, with MLLMs like GPT-5.1 showing 100% specificity but low detection rates (5.9%) for harder synthetic outputs.
Quick Take
A study reveals that humans achieved only 64.8% accuracy in distinguishing authentic legal images from AI-generated ones, with MLLMs like GPT-5.1 showing 100% specificity but low detection rates (5.9%) for harder synthetic outputs. This indicates that both humans and AI cannot reliably authenticate visual evidence, necessitating a combined approach for legal proceedings.
Key Points
- SLED-1400 dataset includes 200 authentic and 1,200 synthetic legal images.
- Human participants performed poorly, with accuracy rates near chance for strong AI generators.
- MLLMs maintained 100% specificity but struggled with harder synthetic outputs.
- Errors between humans and MLLMs were largely uncorrelated, indicating distinct detection challenges.
- Visual evidence should be considered contestable in legal contexts, requiring trained reviews.
Article Content
From source RSS / original summaryarXiv:2606. 07613v1 Announce Type: new Abstract: Visual evidence has long been treated as a reliable form of legal proof, but advances in artificial intelligence (AI) are undermining that assumption. This article asks how well humans and frontier multimodal large language models (MLLMs) can distinguish authentic evidentiary photographs from AI-generated counterparts in the object-centric scenarios typical of civil disputes.
We built Synthetic Legal Evidence Detection (SLED-1400), a dataset of 200 authentic evidence images paired with 1,200 synthetic counterparts produced by six contemporary text-to-image generators across ten evidence categories. The same stimuli and response format were used in a controlled web experiment with 136 lay participants and in a standardized evaluation of four MLLMs (GPT-5. 1, Gemini-3-Pro, Gemini-3-Flash, Qwen3-VL-235B). Human accuracy was 64. 8% overall, and 48. 5% and 51.
0% on the two strongest generators (Gemini-3-Pro-Image and Flux-2-Max), indistinguishable from chance. MLLMs never misclassified an authentic image (100% specificity), but missed most synthetic outputs from the harder generators, with average MLLM detection at 5. 9% on Gemini-3-Pro-Image outputs. Human and MLLM errors were largely uncorrelated, while the four MLLMs were strongly correlated with each other. Neither group is a reliable standalone authenticator.
We argue that visual evidence in legal proceedings should be treated as inherently contestable, and that a workable procedural response must combine trained human review, MLLM screening, and provenance infrastructure such as C2PA Content Credentials.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.