Readable Yet Unpredictable: Rotated-Outcome Prediction in Vision-Language Models
Quick Answer
This paper shows that Vision-language models struggle with Rotated-Outcome Prediction, failing to infer a 180° rotated image from the original.
Quick Take
Vision-language models struggle with Rotated-Outcome Prediction, failing to infer a 180° rotated image from the original. Despite high accuracy in recognizing images directly, models like those tested in RotOutBench show near-zero prediction accuracy for rotated outcomes. This indicates a significant gap in current VLM capabilities, impacting applications requiring predictive visual reasoning.
Key Points
- RotOutBench benchmark reveals VLMs' limitations in predicting rotated outcomes.
- Models can recognize images but fail to predict rotated states from originals.
- Accuracy for predicting rotated images drops to near zero in controlled tests.
- Case studies show prediction states approach rotated readings but misalign.
- Current VLMs excel in direct recognition but lack predictive reasoning.
Article Excerpt
From source RSS / original summaryarXiv:2606. 07641v1 Announce Type: new Abstract: Can vision-language models predict what a 180{\deg} rotation would reveal from the original image alone? We study this ability through Rotated-Outcome Prediction: given an original image, a model must answer what would be seen or read after a 180{\deg} in-plane rotation, without directly observing the rotated target. To isolate this gap, we introduce RotOutBench, a paired diagnostic benchmark spanning open visual cases and controlled text-image rotations.
A sharp pattern emerges: many VLMs can recognize the relevant content when directly given either the original or rotated image, yet fail to infer the rotated result from the original image alone. On controlled text-image rotations, predicted-rotation accuracy collapses to near zero even for models with high direct-reading accuracy. A model-level case study further shows that the prediction state can approach a rotated-image reading state, while the final readout still shifts toward the original string.
Current VLMs can recognize a transformed visual state when it is shown, but often fail to predict that state from the original view.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval
A phase-aware LLM agent optimizes human-object interaction retrieval, outperforming Optuna TPE by 33.3% and VDTuner by 34.2% on the HICO-DET benchmark. This method enhances throughput by 15.3x over UniIR and demonstrates strong transferability across vector database management systems.