Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

arXiv cs.AI·Yao Fehlis, Benjamin Bengfort, Zhangzhang Si, Vahid Eyorokon, Prema Roman, Patrick Deziel, Devon Slonaker, Steve Veldman, Ben Johnson, Joyce Rigelo, Michael Wharton, Steve Kramer

5/20/2026

·~2 min·5/20/2026·en·2

Quick Answer

This paper presents a microservice architecture for operationalizing Document AI, focusing on OCR and LLM pipelines.

Quick Take

This paper presents a microservice architecture for operationalizing Document AI, focusing on OCR and LLM pipelines. Key findings reveal that OCR significantly influences end-to-end latency, and concurrency is limited by shared GPU capacity rather than worker count, providing actionable insights for practitioners in deploying document understanding systems at scale.

Key Points

Microservice architecture encapsulates classification, OCR, and LLM extraction pipelines.
Asynchronous processing optimizes IO-bound operations in the pipeline.
OCR dominates end-to-end latency over language-model parsing.
System concurrency is limited by shared GPU-inference capacity.
Provides architectural patterns for effective document understanding systems.

Paper Resources

Read Paperarxiv.org View PDFarxiv.org

📖 Reader Mode

~2 min read

[Submitted on 12 May 2026]

Authors:Yao Fehlis, Benjamin Bengfort, Zhangzhang Si, Vahid Eyorokon, Prema Roman, Patrick Deziel, Devon Slonaker, Steve Veldman, Ben Johnson, Joyce Rigelo, Michael Wharton, Steve Kramer

View PDF HTML (experimental)

Abstract:Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.

Subjects:	Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE)
Cite as:	arXiv:2605.18818 [cs.AI]
	(or arXiv:2605.18818v1 [cs.AI] for this version)
	https://doi.org/10.48550/arXiv.2605.18818 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Yao Fehlis [view email]
[v1] Tue, 12 May 2026 13:07:34 UTC (20 KB)

— Originally published at arxiv.org

Continue reading on arxiv.org

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from arXiv cs.AI

See more →

arXiv cs.AI·Ye Liu, Srijan Bansal, Bo Pang, Yang Li, Zeyu Leo Liu, Yifei Ming, Zixuan Ke, Shafiq Joty, Semih Yavuz

1d ago

FeaturedOriginal

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

AI Summary

Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.

#LLM #AI Coding #Inference #Policy

Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

Quick Answer

Quick Take

Key Points

Paper Resources

📖 Reader Mode

Submission history

Want this in your inbox every morning?

More from arXiv cs.AI

Procedural Memory Distillation: Online Reflection for Self-Improving Language Models

The Verification Horizon: No Silver Bullet for Coding Agent Rewards

Agentic Analysis for Agentic Infrastructure: An LLM-Powered Pipeline for Comparative Governance of DAO and Corporate AI Protocols

Related in this space

Deploy a Production-Ready NVIDIA AI-Q Blueprint on Oracle Cloud Infrastructure

Deploy Long-Context Reasoning and Agentic Workflows with MiniMax M3 on NVIDIA Accelerated Infrastructure

Deploy Self-Evolving Agents for Faster, More Secure Research with a Hermes Agent and NVIDIA NemoClaw