Hallucination Behavior in Multimodal LLMs Across Agricultural Image Interpretation and Generation Tasks
Quick Take
This study reveals that Large Language Models (LLMs) like Gemma and GPT-5 exhibit hallucinations in agricultural imaging, with zero-shot accuracy ranging from 63% to 75% and few-shot prompting improving it to 86.8%. In text-to-image tasks, models generated up to 91% biologically inconsistent scenes, highlighting critical weaknesses in current LLMs for agricultural applications.
Key Points
- LLMs show hallucinations in agricultural tasks, impacting agronomic insights.
- Zero-shot accuracy for image interpretation ranges from 63% to 75%.
- Few-shot prompting improves accuracy up to 86.8%, but errors persist.
- Text-to-image models generate 91% biologically inconsistent scenes.
- Study provides insights for enhancing LLM reliability in agriculture.
Article Content
From source RSS / original summaryarXiv:2605. 27595v1 Announce Type: new Abstract: Large Language Models (LLMs) are being rapidly adopted in agricultural imaging applications, ranging from crop interpretation to synthetic field image generation. However, these models frequently exhibit hallucinations outputs that appear confident yet deviate from biological or environmental reality potentially leading to misinformed agronomic insights.
This study investigates such hallucinations in two complementary directions: image-to-text, where LLMs interpret crop or field imagery to describe conditions such as biotic and abiotic stresses, and text-to-image, where models generate synthetic agricultural scenes based on descriptive prompts. We examine errors involving biological inconsistency, contextual inaccuracy, and agronomic implausibility, evaluating the outputs under domain-informed criteria across multiple imaging modalities.
Our analysis identifies recurring hallucination patterns within both interpretive and generative tasks. In image interpretation, LLMs (e. g. , Gemma, LLAVA, Qwen, and MiniCPM) achieved modest zero-shot accuracy (63 to 75 percent), whereas few-shot prompting improved performance up to 86. 8 percent, exhibiting false detections and missed infections, indicating residual hallucination effects. In text-to-image tasks, advanced models such as GPT-5 and Gemini 2.
5 Flash generate up to 91 percent biologically inconsistent scenes under relaxed prompt constraints, revealing fundamental weaknesses in current LLMs. This systematic assessment of visual reasoning and generation offers critical insights toward enhancing the reliability and trustworthiness of LLM-based agricultural imaging platforms.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.