ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law

arXiv cs.CL·Nazarii Shportun

6/1/2026

·~2 min·6/1/2026·en·1

Quick Answer

This paper shows that The ImmigrationQA dataset comprises 17,058 question-answer pairs across 13 immigration subdomains, fine-tuned on a Llama 3.2 3B Instruct model using LoRA, achieving a 27% improvement in mean score over the base model.

Quick Take

The ImmigrationQA dataset comprises 17,058 question-answer pairs across 13 immigration subdomains, fine-tuned on a Llama 3.2 3B Instruct model using LoRA, achieving a 27% improvement in mean score over the base model. The system, costing approximately $29 in cloud compute, aids petitioners lacking legal representation but is not a substitute for legal counsel.

Key Points

Dataset constructed from 11 sources, including USCIS Policy Manual and BIA decisions.
Fine-tuned model scored 1.08/3.0, outperforming the Llama 3 8B base model at 0.85/3.0.
Model shows significant improvement in procedural subdomains but struggles with complex legal reasoning.
All artifacts including dataset and model are publicly available.
System does not reflect regulatory changes post-corpus crawl date.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

Article Content

From source RSS / original summary

arXiv:2605. 30589v1 Announce Type: new Abstract: U. S. immigration law spans thousands of pages of official policy, federal regulations, and procedural guidance that change frequently and carry high stakes for petitioners who lack legal representation. We describe the construction of ImmigrationQA, a source-grounded question-answering dataset of 17,058 pairs across 13 immigration subdomains, and the fine-tuning of a Llama 3. 2 3B Instruct model on that dataset using parameter-efficient LoRA.

The corpus was assembled from 11 primary and secondary sources -- including the USCIS Policy Manual, 8 CFR, BIA precedent decisions, and community Q&A -- yielding 10,056 validated canonical documents and 18,308 text chunks. Structured QA pairs were generated from these chunks using Claude Sonnet 4. 6 via five mode-specific prompts, with 22 pairs rejected for insufficient source-span overlap.

The fine-tuned model was evaluated against a held-out split of 993 pairs using LLM-as-judge scoring on a 101-example stratified sample. The fine-tuned model scored a mean of 1. 08/3. 0 (16. 8% fully correct; 101-example stratified eval) versus the Llama 3 8B base model at 0. 85/3. 0 (4% fully correct), a relative improvement of 27% in mean score; a zero-shot Claude Sonnet baseline scored 1. 52/3. 0 (25% fully correct).

The fine-tuned model shows concentrated improvement in procedural subdomains (travel documents, adjustment of status, nonimmigrant visas) while remaining weak on complex legal reasoning and time-sensitive statistics. The full pipeline ran for approximately $29 in cloud compute. All artifacts -- dataset, model, code, and prompt templates -- are publicly released. The system is not a substitute for legal counsel and does not reflect regulatory changes after the corpus crawl date.

Read on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.CL

See more →

arXiv cs.CL·Miguel Arana-Catania, Catherine Conisbee, Matthew Kidd

3d ago

FeaturedOriginal

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

AI Summary

The study evaluates three NLP approaches—Named Entity Recognition, Keyword Extraction, and Topic Modelling—using the Their Finest Hour Online Archive to automate keyword extraction from crowdsourced WWII collections. Findings suggest that while NLP methods show promise, no single approach is sufficient, and ethical considerations in automated keyword extraction are crucial for responsible stewardship.

#AI Coding #Inference #Open Source #Policy

ImmigrationQA: A Source-Grounded Dataset and Small-Model Adaptation for U.S. Immigration Law

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quick Answer

Quick Take

Key Points

Paper Resources

Article Content

Want this in your inbox every morning?

More from arXiv cs.CL

Letting the Data Speak: Extracting Keywords from Crowdsourced Collections with AI

Quantifying Prior Dominance in RAG Systems

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

Quantifying Prior Dominance in Systems