Bounded-Compute Multimodal Regression for Product-Rating Prediction
Quick Take
The study presents a bounded-compute adaptation of SmolVLM2-256M-Video-Instruct for product-rating prediction, achieving 0.39 PLCC and 0.40 CES on the LoViF 2026 challenge. By replacing the language-modeling head with a lightweight MLP and using static image inputs, the model outperforms traditional dynamic processing methods. This approach offers a strong baseline for multimodal regression under strict latency constraints.
Key Points
- Achieved 0.39 PLCC and 0.40 CES on the LoViF 2026 Efficient VLM challenge.
- Replaced language-modeling head with a lightweight two-layer MLP for efficiency.
- Static global image processing slightly outperforms dynamic tiling in tests.
- Scaling training data from 100K to 16M significantly improves validation correlation.
- Model offers a reproducible baseline for resource-constrained multimodal regression.
Article Excerpt
From source RSS / original summaryarXiv:2605. 27737v1 Announce Type: new Abstract: Vision-language models (VLMs) are increasingly attractive for multimodal quality assessment, but their default reliance on autoregressive text generation and dynamic visual processing is poorly matched to scalar regression under strict latency budgets. We present a bounded-compute adaptation of SmolVLM2-256M-Video-Instruct for product-rating prediction in the LoViF 2026 Efficient VLM challenge.
Motivated by recent multimodal engagement-prediction results showing that feature-based regression can outperform token-based score generation, we replace the language-modeling head with a lightweight two-layer MLP fed by pooled decoder states, and we enforce deterministic inputs through fixed 384x384 images and truncated metadata.
Across controlled ablations, static global image processing slightly outperforms dynamic tiling, and scaling from 100K to 16M training examples substantially improves validation correlation. Under the official held-out evaluation, our 228M-parameter model achieves 0. 39 PLCC and 0. 40 CES, providing a strong and reproducible baseline for resource-constrained multimodal regression.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CV
See more →Evi-Steer: Learning to Steer Biomedical Vision-Language Models through Efficient and Generalizable Evidential Tuning
Evi-Steer introduces a novel evidential tuning framework for BiomedCLIP, achieving 0.11% parameter updates while enhancing uncertainty-aware fine-tuning. It outperforms state-of-the-art methods across 15 biomedical imaging datasets, proving effective in few-shot learning and domain shifts for clinical applications.