ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law
Quick Take
The ImmigrationQA dataset comprises 17,058 question-answer pairs across 13 immigration subdomains, fine-tuned on a Llama 3.2 3B Instruct model using LoRA, achieving a 27% improvement in mean score over the base model. The system, costing approximately $29 in cloud compute, aids petitioners lacking legal representation but is not a substitute for legal counsel.
Key Points
- Dataset constructed from 11 sources, including USCIS Policy Manual and BIA decisions.
- Fine-tuned model scored 1.08/3.0, outperforming the Llama 3 8B base model at 0.85/3.0.
- Model shows significant improvement in procedural subdomains but struggles with complex legal reasoning.
- All artifacts including dataset and model are publicly available.
- System does not reflect regulatory changes post-corpus crawl date.
Article Content
From source RSS / original summaryarXiv:2605. 30589v1 Announce Type: new Abstract: U. S. immigration law spans thousands of pages of official policy, federal regulations, and procedural guidance that change frequently and carry high stakes for petitioners who lack legal representation. We describe the construction of ImmigrationQA, a source-grounded question-answering dataset of 17,058 pairs across 13 immigration subdomains, and the fine-tuning of a Llama 3. 2 3B Instruct model on that dataset using parameter-efficient LoRA.
The corpus was assembled from 11 primary and secondary sources -- including the USCIS Policy Manual, 8 CFR, BIA precedent decisions, and community Q&A -- yielding 10,056 validated canonical documents and 18,308 text chunks. Structured QA pairs were generated from these chunks using Claude Sonnet 4. 6 via five mode-specific prompts, with 22 pairs rejected for insufficient source-span overlap.
The fine-tuned model was evaluated against a held-out split of 993 pairs using LLM-as-judge scoring on a 101-example stratified sample. The fine-tuned model scored a mean of 1. 08/3. 0 (16. 8% fully correct; 101-example stratified eval) versus the Llama 3 8B base model at 0. 85/3. 0 (4% fully correct), a relative improvement of 27% in mean score; a zero-shot Claude Sonnet baseline scored 1. 52/3. 0 (25% fully correct).
The fine-tuned model shows concentrated improvement in procedural subdomains (travel documents, adjustment of status, nonimmigrant visas) while remaining weak on complex legal reasoning and time-sensitive statistics. The full pipeline ran for approximately $29 in cloud compute. All artifacts -- dataset, model, code, and prompt templates -- are publicly released. The system is not a substitute for legal counsel and does not reflect regulatory changes after the corpus crawl date.
Reader Mode unavailable (could not extract clean content).
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.CL
See more →Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?
The REFLECT benchmark reveals that current LLM judges are unreliable, achieving below 55% accuracy in evaluating reasoning and evidence use, highlighting the need for improved evaluation methods for deep research agents.