Using Probabilistic Programs to Train Inductive Reasoning in Large Language Models
Quick Answer
This paper shows that The Program-based Posterior Training (PPT) method enhances inductive reasoning in Large Language Models (LLMs) by generating 10,000 diverse scenarios for fine-tuning, leading to improved accuracy and alignment with human judgments.
Quick Take
The Program-based Posterior Training (PPT) method enhances inductive reasoning in Large Language Models (LLMs) by generating 10,000 diverse scenarios for fine-tuning, leading to improved accuracy and alignment with human judgments. This approach demonstrates significant gains in estimation and calibration over traditional methods, indicating a deeper understanding of uncertainty in LLMs.
Key Points
- PPT fine-tunes LLMs using probabilistic programs for inductive reasoning.
- 10,000 scenarios generated to improve model performance on held-out tasks.
- Significant accuracy gains observed in estimation and human alignment.
- Raw calibration improvements exceed those from post-hoc temperature scaling.
- PPT shows promise for reliable approximate inductive inference in LLMs.
Paper Resources
Article Content
From source RSS / original summaryarXiv:2606. 09856v1 Announce Type: new Abstract: Post-training Large Language Models (LLMs) for reasoning typically focuses on deductive tasks such as mathematics and coding where correctness is verifiable. Yet, many real-world reasoning problems are inductive: agents must infer uncertain beliefs from sparse, ambiguous observations.
There are challenges to using standard fine-tuning methods for inductive reasoning, including difficulties in curating large-scale, high-quality labeled datasets and in handling targets that are inherently distributional.
In this work, we introduce a novel approach, called Program-based Posterior Training (PPT), to address these limitations: we use an LLM to generate diverse open-world scenarios as probabilistic programs, run probabilistic inference to produce distributional target responses to queries, and then fine-tune on these probabilistic soft labels. Using this approach, we fine-tune LLMs on 10,000 programmatically generated scenarios and evaluate on held-out motifs, human-labeled judgments, and external benchmarks.
Overall, PPT substantially improves estimation accuracy on held-out inductive tasks, increases alignment with human judgments, and transfers to external benchmarks for estimation and calibration. Additionally, the gains in raw calibration are not subsumed by post-hoc temperature scaling, showing that the models have more deeply internalized uncertainty compared to output rescaling.
Together, these results suggest that probabilistic-program-mediated fine-tuning is a promising approach for post-training LLMs to reliably perform approximate inductive inference.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.