Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

arXiv cs.AI·Zhiyuan Jerry Lin, Benjamin Letham, Samuel Dooley, Maximilian Balandat, Eytan Bakshy

5/20/2026

·~2 min·5/20/2026·en·1

Quick Answer

The paper introduces ReElicit, a Bayesian optimization framework that enhances system prompt tuning in AI by using an embedding approach to adaptively represent prompts based on aggregate feedback.

Quick Take

The paper introduces ReElicit, a Bayesian optimization framework that enhances system prompt tuning in AI by using an embedding approach to adaptively represent prompts based on aggregate feedback. It demonstrates superior performance in optimizing prompts across ten tasks, achieving the best aggregate results compared to existing methods, thus showcasing LLMs as effective semantic representation builders.

Key Points

ReElicit uses Bayesian optimization to tune system prompts based on aggregate feedback.
It employs a Gaussian process surrogate for selecting target feature vectors.
The framework adapts to new evaluations, improving prompt representation over time.
Achieved the best aggregate performance across ten optimization tasks with limited evaluations.
Demonstrates LLMs' potential as adaptive semantic representation builders.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 18 May 2026]

View PDF HTML (experimental)

Abstract:System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per-example labels, failures, or critiques. We study this aggregate feedback setting as sample-constrained black-box optimization over discrete, variable-length text. We introduce ReElicit, a Bayesian optimization framework based on \emph{embedding by elicitation}. Given a task description, previously evaluated prompts, and scalar scores, an LLM elicits a compact, interpretable feature space and maps prompts into it. Leveraging a probabilistic Gaussian process surrogate, an acquisition function then selects target feature vectors, which the LLM realizes and refines into deployable system prompts. Re-eliciting the feature space as new evaluations arrive lets the representation adapt to the observed prompt-score history. We evaluate the setting using offline benchmark accuracy as a controlled aggregate proxy: the optimizer observes one scalar score per prompt and no per-example labels, errors, or critiques. Across ten system prompt optimization tasks with a 30 total evaluation budget, ReElicit achieves the strongest aggregate performance profile among representative aggregate-only prompt-optimization baselines. These results suggest that LLMs can serve as adaptive semantic representation builders, not only prompt generators, for Bayesian optimization over natural-language artifacts.

Subjects:	Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2605.19093 [cs.AI]
	(or arXiv:2605.19093v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.19093 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Zhiyuan Jerry Lin [view email]
[v1] Mon, 18 May 2026 20:28:17 UTC (142 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

1d ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

#LLM #AI Coding #Inference #Policy