Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining
Quick Answer
This study presents a staged promotion protocol for micro-pretraining on Windows A100 and Linux L40S, demonstrating that short pretraining runs can lead to over-promotion of configurations.
Quick Take
This study presents a staged promotion protocol for micro-pretraining on Windows A100 and Linux L40S, demonstrating that short pretraining runs can lead to over-promotion of configurations. The protocol, involving budgets from 2 minutes to 12 hours, shows that the top-ranked condition after 12 hours does not align with earlier 10-minute rankings, highlighting the instability of early screens and the importance of operational promotion evidence.
Key Points
- Staged budgets ranged from 2 minutes to 12 hours across two host types.
- The 12-hour top-ranked condition did not match the 10-minute mean-best condition.
- The protocol executed 144 GPU-hours for the 12-hour branch, totaling 169.2 GPU-hours overall.
- Continuing all 60-minute candidates would require 192 GPU-hours, while 10-minute candidates would need 432 GPU-hours.
- Findings indicate bounded cost allocation, not claims of global optimality.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 11387v1 Announce Type: new Abstract: Short pretraining runs can reduce experimental cost, but they can also over-promote configurations that only look strong at tiny budgets. We study an auditable staged-promotion protocol for a fixed micro-pretraining runner on two heterogeneous host blocks: Windows A100 and Linux L40S.
Starting from twelve prior-screened configurations, we use staged budgets of 2 minutes, 5 minutes, 10 minutes, 60 minutes, and 12 hours, with frozen promotion rules before expensive continuations. The early screens are intentionally treated as unstable: the 5- and 10-minute rankings are host-sensitive, and the eventual 12-hour top-ranked condition is not the mean-best condition at the replicated 10-minute gate.
Because seed ranges differ across stages, these changes are operational promotion evidence, not within-seed curves. A replicated 60-minute gate keeps the Staged Factorial Screening bridge reference in the promoted set, where it ranks first in all four 60-minute host-seed cells. In the final 12-hour confirmation package, the bridge condition ranks first in all four host-seed cells across two seeds; the greedy comparator does not meet the frozen 0.
010 val_bpb near-equivalence rule; and the cheaper d8/ar48 (depth-8, aspect-48) sentinel does not meet the frozen 0. 020 mean-gap rule. The executed 12-hour branch spends 144 GPU-hours, and the full staged protocol records 169. 2 training GPU-hours including screening stages. Continuing all four 60-minute candidates would spend 192 GPU-hours, while continuing all nine replicated 10-minute candidates would spend 432 GPU-hours.
The latter numbers are accounting counterfactuals for unrun continuations, not evidence that skipped candidates could not have overtaken the reference. The result is a bounded cost-allocation finding, not a claim of global optimality, capacity-normalized superiority, or superiority over adaptive hyperparameter optimization methods.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.


