A World Model of Radiologist Reading for Medical Image Representation Learning

arXiv cs.CV·Yiwei Li, Zihao Wu, Huaqin Zhao, Yifan Zhou, Chao Cao, Dajiang Zhu, Tianming Liu, Lin Zhao

4d ago

·~1 min·5/26/2026·en·0

Quick Take

GazeWorld is a novel medical imaging model that predicts latent representations based on radiologist eye-tracking data, achieving state-of-the-art diagnostic accuracy on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax benchmarks. It outperformed LogitGaze-Med by over 16% in ScanMatch and 22% in SED on the GazeSearch benchmark, demonstrating the potential of modeling expert reading behaviors for AI pretraining.

Key Points

GazeWorld predicts the next fixated patch's representation autoregressively from prior patches.
Achieved state-of-the-art accuracy across nine supervised settings on key medical benchmarks.
Outperformed LogitGaze-Med by 16% in ScanMatch and 22% in SED on GazeSearch.
Generates patch representations without requiring actual gaze data during inference.
Demonstrates the value of modeling expert reading processes for medical imaging AI.

Article Content

From source RSS / original summary

arXiv:2605. 23992v1 Announce Type: new Abstract: Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, existing methods exploit this signal only partially, either as a static spatial prior or as an auxiliary prediction target decoupled from diagnosis. We propose GazeWorld, a medical imaging world model that treats the image as the world and the radiologist's fixation sequence as a trajectory through it.

GazeWorld autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, while a spatial-completion branch covers unvisited regions. At inference, GazeWorld generates a sequence of patch representations from the image alone without requiring real gaze data.

Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, as well as the highest zero-shot accuracy on all three benchmarks. On the GazeSearch benchmark, a generic decoder trained on the same frozen features outperforms the purpose-built LogitGaze-Med by over 16\% in ScanMatch and 22\% in SED, despite not being explicitly trained to predict gaze.

GazeWorld demonstrates that modeling how experts read, not just what they conclude, offers a promising pretraining paradigm for medical imaging AI.

Reader Mode unavailable (could not extract clean content).

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CV

See more →

arXiv cs.CV·Taha Koleilat, Hassan Rivaz, Yiming Xiao

3d ago

FeaturedOriginal

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

AI Summary

Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.

#AI Coding #Inference #Open Source

A World Model of Radiologist Reading for Medical Image Representation Learning

Quick Take

Key Points

Article Content

Want this in your inbox every morning?

More from arXiv cs.CV

Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning

Deep Learning-Based Automated Quantification of TIMI Myocardial Perfusion Frame Count (DL-TMPFC) from Coronary Angiography: A Novel Framework for Rapid Assessment of Microvascular Dysfunction

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

Related in this space

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

TorqueAGI Announces Collaborations with NVIDIA, John Deere, and Dexterity to Advance Physical AI for Enterprise-Grade Robots

FORT Robotics Acquires Mapless AI to Expand Its Trust Platform with Remote Supervision and Active Safety Capabilities