
Baidu's "Unlimited OCR" processes dozens of document pages in one pass by treating memory like human forgetting
Quick Answer
Baidu's Unlimited OCR leverages Reference Sliding Window Attention (R-SWA) to maintain a fixed memory cache, achieving a 93.92% score on OmniDocBench v1.6, significantly improving efficiency for multi-page document processing.
Quick Take
Baidu's Unlimited OCR leverages Reference Sliding Window Attention (R-SWA) to maintain a fixed memory cache, achieving a 93.92% score on OmniDocBench v1.6, significantly improving efficiency for multi-page document processing. The model processes up to 5,580 tokens per second, outperforming its predecessor, Deepseek OCR, by 12.7%. Future enhancements aim to extend context length to 128,000 tokens.
Key Points
- Unlimited OCR processes documents using R-SWA, capping memory usage effectively.
- Achieved a 93.92% score on OmniDocBench v1.6, outperforming Deepseek OCR.
- Maintains a constant KV cache, improving processing speed to 5,580 tokens per second.
- Trained on 2 million document samples, with a focus on multi-page data.
- Plans to extend context length to 128,000 tokens for better performance.
📖 Reader Mode
~5 min readCurrent end-to-end systems use a language model as their decoder, so this buffer grows with every new line of text. That drives up memory use and steadily slows generation. In practice, systems get around the problem with a loop that processes each document page by page, resetting the cache after every step.

Baidu frames the problem with a human analogy. Someone copying a book doesn't re-read everything they've already written. They keep their eyes on the source, the last few characters they wrote, and the next one to put down. Older passages fade through a kind of soft forgetting. The researchers want Unlimited OCR to mimic that pattern.
A fixed window caps memory use
It works through what the team calls Reference Sliding Window Attention (R-SWA). Each generated token still sees all reference tokens, the visual image tokens and the prompt. But when it comes to previously generated output, it only looks back at the last 128 tokens. That keeps the KV cache constant throughout the entire process instead of growing linearly with output length.

Standard sliding window attention would also subject visual tokens to ongoing state changes, gradually blurring image features and degrading recognition. R-SWA exempts visual tokens from these transitions. They're encoded once and stay unchanged.
The KV cache works as a queue where each new token pushes out the oldest one. With standard multi-head attention, memory use grows without bound as token count rises. R-SWA caps it at the fixed sum of prefix length and window size.

Built on top of Deepseek OCR
Unlimited OCR builds on the open-source Deepseek OCR model. Baidu keeps its DeepEncoder and pairs it with a mixture-of-experts architecture with three billion parameters, of which only about 500 million are active during inference. The DeepEncoder compresses a 1024-by-1024-pixel PDF image down to 256 tokens.
Two resolution modes carry over. "Base" mode handles multi-page documents, and "Gundam" mode uses dynamic resolution for single pages. Every standard attention layer in the decoder was swapped out for R-SWA.
Training used about two million document samples, split 9-to-1 between single-page and multi-page data. Paddle OCR handled annotation for single pages. Multi-page data was built synthetically by stitching single pages together into documents ranging from two to 50 pages.
All data was packed into sequences of 32,000 tokens; training ran for 4,000 steps on 8 times 16 Nvidia A800 GPUs. The DeepEncoder stayed frozen, and only the language model parameters were updated.
Better scores despite limited attention
Unlimited OCR scores 93 percent overall on the OmniDocBench v1.5 document benchmark, six percentage points above the Deepseek OCR baseline, according to the authors. The benchmark measures several sub-tasks. Pure text recognition error rate, measured as edit distance (the number of corrections needed per character), drops slightly. Table structure recognition improves more sharply, by nearly six percentage points. On the newer v1.6 version, the model hits 93.92 percent, putting it at the top of the end-to-end system rankings.
In the long-horizon test, where the model processes many pages in a single pass, the error rate stays below 0.11 even past 40 pages. The authors pin the remaining errors not on lost context but on the DeepEncoder's resolution limit in Base mode when text gets tiny.
Restricting the window to 128 tokens on single pages doesn't hurt accuracy. It actually helps slightly. The researchers suspect R-SWA forces the model to focus more tightly on the dense OCR task, while full attention tends toward divergence as output length grows.
The constant cache also pays off in speed. In Base mode, Unlimited OCR hits 5,580 tokens per second versus 4,951 for Deepseek OCR, a 12.7 percent bump. In a theoretical comparison of upper bounds with ideal parallelism, the model leads the baseline by 35 percent at around 6,000 output tokens, while the baseline's throughput drops steadily as length increases.
For long document parsing, the model's core strength, it holds an edit distance below 0.11 and a Distinct-35 score of 97 percent even at 40-plus pages, according to Baidu. Errors showed up mainly with tiny text, which the researchers trace to Base mode's limited resolution rather than any orientation problem with R-SWA.
Not truly unlimited yet
The model's fixed context length of 32,000 tokens limits how many pages it can take in, since visual tokens stack up with each additional page. Baidu plans to train 128,000-token models soon and eventually build a prefill pool that lets the model fetch relevant KV blocks on its own, like flipping through a book. The authors also see R-SWA as transferable to other reference-based tasks like speech recognition and translation.
Code and model weights are on GitHub and Hugging Face. The model runs on ModelScope and the inference engines vLLM and SGLang. You can try it in a demo on Hugging Face Spaces.
OCR has become one of AI's more active battlegrounds, with models competing mainly on token efficiency. The interest goes well beyond document recognition. Since image-based text uses far less compute than its digital equivalent, the method could expand language model memory for long chat histories or large documents. Developers already use this to cut token costs on Anthropic's Fable 5.
Deepseek pushed in this direction earlier this year with Deepseek OCR 2, an encoder that rearranges image information semantically instead of reading rigidly from top-left to bottom-right. It scores 91.09 percent on OmniDocBench v1.5.
Mistral AI is building out its position with Mistral OCR 3, touting better recognition of handwriting, forms, and complex tables. For Baidu, this work fits into a broader AI push. The company recently shipped Ernie 5.1, a multimodal model that ranked as the top Chinese model on LMArena.
Quickly scannable books are also attractive as training data for new language models, a topic sparking heated debate. Researchers have shown that large language models can reproduce near-verbatim passages from copyrighted books like "Harry Potter" and "The Hobbit."
— Originally published at the-decoder.com
Want this in your inbox every morning?
Daily brief at your local 8am — bilingual EN/中文, free.
More from The Decoder
See more →
An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run
Epoch AI's MirrorCode benchmark reveals Claude Opus 4.7 as the leader with a 56% solve rate, reconstructing a 16,000-line toolkit in 14 hours. Despite this, all models tested struggle with the most complex tasks, highlighting limitations in current AI capabilities. The single task consumed $2,600 over 19 days, raising questions about cost-effectiveness in AI development.

