Memorization Dynamics of Fill-in-the-Middle Pretraining
Quick Take
The study explores memorization dynamics of fill-in-the-middle pretraining versus left-to-right objectives in language models.
Key Points
- FIM enhances recovery of short or partial spans.
- LTR shows higher confidence in long exact continuations.
- Verbatim recall under FIM is prefix-dependent.
Article Excerpt
From source RSS / original summaryarXiv:2605. 22981v1 Announce Type: new Abstract: Fill-in-the-middle (FIM) is a pretraining objective widely used to equip causal language models with infilling ability, yet its effect on verbatim memorization remains underexplored. We study the memorization dynamics of FIM in a controlled setting by pretraining matched Llama 3. 2 models with FIM and standard left-to-right (LTR) objectives on a FineWeb-Gutenberg corpus containing repeated Gutenberg excerpts.
With prefix-based probes, FIM more often recovers short or partially matching spans, while LTR more often assigns high confidence to long exact continuations. We observe that verbatim extraction under FIM-training grows approximately linearly with repetitions over the tested range. Evaluating native FIM-format probes reveals that suffix context is not sufficient: verbatim recall under FIM-training remains strongly anchored in prefix context.
Our results also show that evaluating only one span length or probing format can miss important nuances in memorization behavior.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The reliability of LLM judges for evaluating deep research agents is critically assessed using the REFLECT benchmark.