SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

arXiv cs.AI·Han Li, Vibhor Malik, Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ailin Fan, Keat Yang Koay, Yuanzheng Zhu, Meysam Feghhi, Ronie Uliana, Zhaoyu Zhang, Angelo Ocana Martins, Mingyu Zhao, Francis Pelland, Jonathan Faerman, Nikolas LeBlanc, Aaron Glazer, Andrew McNamara, Zhong Wu, Lingyun Wang

5/20/2026

·~2 min·5/20/2026·en·0

Quick Answer

SimGym is a novel A/B testing framework for e-commerce that utilizes vision-language model agents to simulate tests in real-time, achieving 77% alignment with actual buyer behavior.

Quick Take

SimGym is a novel A/B testing framework for e-commerce that utilizes agents to simulate tests in real-time, achieving 77% alignment with actual buyer behavior. This approach reduces testing duration from weeks to under an hour, enhancing rapid experimentation without impacting user experience.

Key Points

SimGym employs traffic-grounded persona generation for accurate buyer archetypes.
The framework integrates multimodal perception for coherent shopping sessions.
Empirical validation shows strong agreement with real buyer behavior shifts.
Testing cycles are reduced from weeks to under an hour.
No exposure of real buyers to candidate variants during simulations.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 19 May 2026]

Authors:Han Li, Vibhor Malik, Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ailin Fan, Keat Yang Koay, Yuanzheng Zhu, Meysam Feghhi, Ronie Uliana, Zhaoyu Zhang, Angelo Ocana Martins, Mingyu Zhao, Francis Pelland, Jonathan Faerman, Nikolas LeBlanc, Aaron Glazer, Andrew McNamara, Zhong Wu, Lingyun Wang

View PDF HTML (experimental)

Abstract:A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks degrading user experience. We present SimGym, a framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM) agents operating in a live browser. The framework comprises three key components: (a) a traffic-grounded persona generation pipeline that derives per-shop buyer archetypes and intents from production clickstream data; (b) a live-browser agent architecture that combines multimodal perception over visual and browser-structured observations with episodic memory and guardrails to conduct coherent shopping sessions across control and treatment storefronts; and (c) an evaluation protocol that compares simulated outcome shifts with observed shifts in real buyer behavior. We validate SimGym on A/B tests of visually driven UI theme changes from a major e-commerce platform across diverse storefronts and product categories. Empirical results show that SimGym agents achieve strong agreement with observed outcome shifts, attaining 77% directional alignment with add-to-cart shifts observed across interface variants in real-buyer traffic. It reduces experimental cycles from weeks to under an hour, enabling rapid experimentation without exposing real buyers to candidate variants.

Subjects:	Artificial Intelligence (cs.AI)
Cite as:	arXiv:2605.19219 [cs.AI]
	(or arXiv:2605.19219v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.19219 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Zahra Zanjani Foumani [view email]
[v1] Tue, 19 May 2026 00:46:41 UTC (6,393 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

1d ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

#LLM #AI Coding #Inference #Policy