Baidu's "Unlimited OCR" processes dozens of document pages in one pass by treating memory like human forgetting

The Decoder·Jonathan Kemper

3h ago

·~5 min·7/5/2026·en·1

Quick Answer

Quick Take

Baidu's Unlimited OCR leverages Reference Sliding Window Attention (R-SWA) to maintain a fixed memory cache, achieving a 93.92% score on OmniDocBench v1.6, significantly improving efficiency for multi-page document processing. The model processes up to 5,580 tokens per second, outperforming its predecessor, Deepseek OCR, by 12.7%. Future enhancements aim to extend context length to 128,000 tokens.

Key Points

Unlimited OCR processes documents using R-SWA, capping memory usage effectively.
Achieved a 93.92% score on OmniDocBench v1.6, outperforming Deepseek OCR.
Maintains a constant KV cache, improving processing speed to 5,580 tokens per second.
Trained on 2 million document samples, with a focus on multi-page data.
Plans to extend context length to 128,000 tokens for better performance.

📖 Reader Mode

~5 min read

Jonathan Kemper

Current end-to-end systems use a language model as their decoder, so this buffer grows with every new line of text. That drives up memory use and steadily slows generation. In practice, systems get around the problem with a loop that processes each document page by page, resetting the cache after every step.

Diagram of Unlimited OCR's architecture. On the left, a person copying a book, labeled "focus the books," "working memory," and "forgetting (soft)." On the right, the architecture showing the DeepEncoder and MoE decoder LLM-(R-SWA), with the KV cache organized as a queue of visual and prompt tokens. — Baidu derives its architecture from how humans copy text by hand. The DeepEncoder compresses pages while the MoE decoder processes them with R-SWA, running the KV cache as a fixed-length queue. | Image: Baidu

Baidu frames the problem with a human analogy. Someone copying a book doesn't re-read everything they've already written. They keep their eyes on the source, the last few characters they wrote, and the next one to put down. Older passages fade through a kind of soft forgetting. The researchers want Unlimited OCR to mimic that pattern.

A fixed window caps memory use

It works through what the team calls Reference Sliding Window Attention (R-SWA). Each generated token still sees all reference tokens, the visual image tokens and the prompt. But when it comes to previously generated output, it only looks back at the last 128 tokens. That keeps the KV cache constant throughout the entire process instead of growing linearly with output length.

Two side-by-side attention matrices titled "Vanilla Attention" and "R-SWA." Color-coded cells distinguish reference tokens, working memory, and unattended positions, showing that R-SWA maintains a constant KV cache unlike full attention. — With R-SWA, each generated token attends to all reference tokens but only the last n output tokens, keeping the KV cache constant throughout decoding. | Image: Baidu

Standard sliding window attention would also subject visual tokens to ongoing state changes, gradually blurring image features and degrading recognition. R-SWA exempts visual tokens from these transitions. They're encoded once and stay unchanged.

The KV cache works as a queue where each new token pushes out the oldest one. With standard multi-head attention, memory use grows without bound as token count rises. R-SWA caps it at the fixed sum of prefix length and window size.

Line chart of per-call Flash Attention v3 kernel latency across decode steps from 0 to 6,000. The DeepSeek OCR curve (Ds-Attn) climbs past 16 microseconds, while the Unlimited OCR curve (UoW-Attn) stays flat at around 9 microseconds. — Deepseek OCR's kernel latency climbs with each decoding step, while Unlimited OCR stays flat thanks to R-SWA. | Image: Baidu

Built on top of Deepseek OCR

Unlimited OCR builds on the open-source Deepseek OCR model. Baidu keeps its DeepEncoder and pairs it with a mixture-of-experts architecture with three billion parameters, of which only about 500 million are active during inference. The DeepEncoder compresses a 1024-by-1024-pixel PDF image down to 256 tokens.

Two resolution modes carry over. "Base" mode handles multi-page documents, and "Gundam" mode uses dynamic resolution for single pages. Every standard attention layer in the decoder was swapped out for R-SWA.

Training used about two million document samples, split 9-to-1 between single-page and multi-page data. Paddle OCR handled annotation for single pages. Multi-page data was built synthetically by stitching single pages together into documents ranging from two to 50 pages.

All data was packed into sequences of 32,000 tokens; training ran for 4,000 steps on 8 times 16 Nvidia A800 GPUs. The DeepEncoder stayed frozen, and only the language model parameters were updated.

Better scores despite limited attention

Unlimited OCR scores 93 percent overall on the OmniDocBench v1.5 document benchmark, six percentage points above the Deepseek OCR baseline, according to the authors. The benchmark measures several sub-tasks. Pure text recognition error rate, measured as edit distance (the number of corrections needed per character), drops slightly. Table structure recognition improves more sharply, by nearly six percentage points. On the newer v1.6 version, the model hits 93.92 percent, putting it at the top of the end-to-end system rankings.

In the long-horizon test, where the model processes many pages in a single pass, the error rate stays below 0.11 even past 40 pages. The authors pin the remaining errors not on lost context but on the DeepEncoder's resolution limit in Base mode when text gets tiny.

Restricting the window to 128 tokens on single pages doesn't hurt accuracy. It actually helps slightly. The researchers suspect R-SWA forces the model to focus more tightly on the dense OCR task, while full attention tends toward divergence as output length grows.

The constant cache also pays off in speed. In Base mode, Unlimited OCR hits 5,580 tokens per second versus 4,951 for Deepseek OCR, a 12.7 percent bump. In a theoretical comparison of upper bounds with ideal parallelism, the model leads the baseline by 35 percent at around 6,000 output tokens, while the baseline's throughput drops steadily as length increases.

For long document parsing, the model's core strength, it holds an edit distance below 0.11 and a Distinct-35 score of 97 percent even at 40-plus pages, according to Baidu. Errors showed up mainly with tiny text, which the researchers trace to Base mode's limited resolution rather than any orientation problem with R-SWA.

Not truly unlimited yet

The model's fixed context length of 32,000 tokens limits how many pages it can take in, since visual tokens stack up with each additional page. Baidu plans to train 128,000-token models soon and eventually build a prefill pool that lets the model fetch relevant KV blocks on its own, like flipping through a book. The authors also see R-SWA as transferable to other reference-based tasks like speech recognition and translation.

Code and model weights are on GitHub and Hugging Face. The model runs on ModelScope and the inference engines vLLM and SGLang. You can try it in a demo on Hugging Face Spaces.

OCR has become one of AI's more active battlegrounds, with models competing mainly on token efficiency. The interest goes well beyond document recognition. Since image-based text uses far less compute than its digital equivalent, the method could expand language model memory for long chat histories or large documents. Developers already use this to cut token costs on Anthropic's Fable 5.

Deepseek pushed in this direction earlier this year with Deepseek OCR 2, an encoder that rearranges image information semantically instead of reading rigidly from top-left to bottom-right. It scores 91.09 percent on OmniDocBench v1.5.

Mistral AI is building out its position with Mistral OCR 3, touting better recognition of handwriting, forms, and complex tables. For Baidu, this work fits into a broader AI push. The company recently shipped Ernie 5.1, a multimodal model that ranked as the top Chinese model on LMArena.

Quickly scannable books are also attractive as training data for new language models, a topic sparking heated debate. Researchers have shown that large language models can reproduce near-verbatim passages from copyrighted books like "Harry Potter" and "The Hobbit."

— Originally published at the-decoder.com

Continue reading on the-decoder.com

Want this in your inbox every morning?

Daily brief at your local 8am — bilingual EN/中文, free.

Subscribe — it's free

More from The Decoder

See more →

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

The Decoder·Matthias Bastian

1w ago

FeaturedOriginal

An AI model programmed nonstop for 19 days on a single MirrorCode task that cost $2,600 to run

AI Summary

Epoch AI's MirrorCode benchmark reveals Claude Opus 4.7 as the leader with a 56% solve rate, reconstructing a 16,000-line toolkit in 14 hours. Despite this, all models tested struggle with the most complex tasks, highlighting limitations in current AI capabilities. The single task consumed $2,600 over 19 days, raising questions about cost-effectiveness in AI development.

#LLM #AI Coding #Inference #AI Startup