Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production
Quick Answer
This paper presents a microservice architecture for operationalizing Document AI, focusing on OCR and LLM pipelines.
Quick Take
This paper presents a microservice architecture for operationalizing Document AI, focusing on OCR and LLM pipelines. Key findings reveal that OCR significantly influences end-to-end latency, and concurrency is limited by shared GPU capacity rather than worker count, providing actionable insights for practitioners in deploying document understanding systems at scale.
Key Points
- Microservice architecture encapsulates classification, OCR, and LLM extraction pipelines.
- Asynchronous processing optimizes IO-bound operations in the pipeline.
- OCR dominates end-to-end latency over language-model parsing.
- System concurrency is limited by shared GPU-inference capacity.
- Provides architectural patterns for effective document understanding systems.
Paper Resources
📖 Reader Mode
~2 min readAuthors:Yao Fehlis, Benjamin Bengfort, Zhangzhang Si, Vahid Eyorokon, Prema Roman, Patrick Deziel, Devon Slonaker, Steve Veldman, Ben Johnson, Joyce Rigelo, Michael Wharton, Steve Kramer
Abstract:Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.
| Subjects: | Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Software Engineering (cs.SE) |
| Cite as: | arXiv:2605.18818 [cs.AI] |
| (or arXiv:2605.18818v1 [cs.AI] for this version) | |
| https://doi.org/10.48550/arXiv.2605.18818 arXiv-issued DOI via DataCite (pending registration) |
Submission history
From: Yao Fehlis [view email]
[v1]
Tue, 12 May 2026 13:07:34 UTC (20 KB)
— Originally published at arxiv.org
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from arXiv cs.AI
See more →Procedural Memory Distillation: Online Reflection for Self-Improving Language Models
Procedural Memory Distillation (PMD) enhances reinforcement learning by converting cross-episode signals into reusable memory, improving Qwen3-8B and OLMo3-Instruct-7B models by 3.8-5.5% on SCIKNOWEVAL and 7.9-13.6% on . The co-evolution of policy and memory allows for more effective self-supervision, demonstrating significant performance gains when both components are active.


