Hybrid AI Routing Cuts Document Processing Costs 75%

A three-tier hybrid architecture routed 70–80% of documents through local deterministic processing, cutting Azure OpenAI API costs by 75% and processing time by 55% on a 4,700-document production workload. The pattern generalizes well beyond the engineering drawings it was built for.

Engineer Obinna Iheanachor described the system in a May 2026 InfoQ article. It inverts the default cloud-AI playbook: instead of sending every document to a managed endpoint, a confidence-gated router first asks whether the document actually needs a model call. For structurally predictable corpora — engineering drawings, invoices, regulatory filings, medical records — the answer is no for the majority of inputs.

Tier 1 uses PyMuPDF for local deterministic extraction. It handles 70–80% of documents at zero API cost and approximately three seconds per document. Its design philosophy is high precision over high recall: when confidence is below threshold, it returns nothing rather than guessing. A composite scoring function weighing spatial, anchor, format, and contextual criteria drives the routing decision; the interaction between criteria catches false positives that any single criterion misses, such as distinguishing a title block candidate scoring 98 from a revision history candidate scoring 66 on the same character. Documents that fail Tier 1 go to Tier 2: Azure OpenAI's GPT-4 Vision endpoint, handling 20–30% of volume at roughly one cent per call and ten seconds per document. Documents where Tier 1 and Tier 2 conflict, or where Tier 2 returns low-confidence output, enter a Tier 3 human review queue — roughly 5% of the total.

FIG. 02 Three-tier hybrid routing distributes documents by cost and processing requirement. Tier 1 handles the majority at zero API expense. — InfoQ hybrid AI architecture case study

On the 4,700-file engineering drawing corpus, a cloud-first approach would have cost $47 in API fees and taken 100 minutes end-to-end, with silent hallucination risk on every document. The hybrid approach cost $10–15 in API fees and ran in 45 minutes. The manual baseline — an engineer locating and transcribing each title block — was approximately 160 person-hours, or over £8,000 per migration run at engineering labor rates. The system has since been adopted across four sites.

FIG. 03 Hybrid routing delivers 75% cost savings and 55% faster processing vs. cloud-first on production workload. — 4,700-document engineering drawing corpus benchmark

For enterprise architects weighing hybrid AI deployments, two findings cut against common assumptions. First, GPT-5+ showed no accuracy improvement over GPT-4.1 on the 400-file validation set, with comparable performance across text-based, scanned, and unusual-layout categories. Model upgrades should be evaluated against task-specific validation sets, not vendor benchmarks. Second, prompt engineering contributed more measurable accuracy gain than model selection. Five successive iterations — each targeting a specific error class such as revision table confusion, grid reference false positives, or confidence calibration — raised system accuracy from 89% to 98%.

Three tiers is the minimum architecture to cover all three failure classes: documents rules can handle, documents needing visual interpretation, and documents where neither method is trustworthy enough to act on without human review. A two-tier system either accepts hallucinated results silently or loses coverage by rejecting them. A four-tier system adds complexity without corresponding reliability gain.

Enterprises already running high-volume document pipelines through managed AI endpoints — Azure OpenAI, AWS Bedrock, Google Vertex — can apply the local-first pattern without changing the cloud tier at all; the router sits in front of it. For organizations facing compliance or data-residency constraints, Tier 1's local-only execution path also reduces the surface area for sensitive data ever reaching external endpoints.

Sources

Three-tier hybrid architecture reduced Azure OpenAI costs by 75% and cut processing time by 55% on a 4,700-document production workload
"A three-tier hybrid architecture reduced Azure OpenAI costs by seventy-five percent and cut processing time by fifty-five percent on a four thousand seven hundred document production workload."
infoq.com ↗
Tier 1 local deterministic extraction handles 70–80% of documents at zero API cost, approximately 3 seconds per document
"Tier 1 handles seventy to eighty percent of documents at zero API cost and approximately three seconds per document."
infoq.com ↗
Tier 2 (Azure OpenAI GPT-4 Vision) handles 20–30% of documents at ~$0.01/call and ~10 seconds per document
"This tier handles twenty to thirty percent of documents at about a penny per call and about ten seconds per document."
infoq.com ↗
Human review queue (Tier 3) captures approximately 5% of documents
"Documents where Tier 1 and Tier 2 produce conflicting results, or where Tier 2 returns low-confidence output, are flagged for manual inspection (approximately five percent of documents)."
infoq.com ↗
Cloud-first approach on the 4,700-file corpus would have cost $47 in Azure OpenAI API calls and taken 100 minutes; hybrid approach cost $10–15 and ran in 45 minutes
"A cloud-first approach would have cost forty-seven dollars in Azure OpenAI API calls, taken one hundred minutes, and introduced silent hallucination risk on every document. The hybrid approach cut API costs to ten to fifteen dollars, processing time to forty-five minutes."
infoq.com ↗
Manual baseline was ~160 person-hours and over £8,000 per migration run
"approximately two minutes per document across four thousand seven hundred files, or roughly one hundred sixty person-hours. At engineering labor rates, that's over eight thousand pounds per migration run."
infoq.com ↗
System adopted across four sites
"The system has been adopted across four sites."
infoq.com ↗
GPT-5+ showed no accuracy improvement over GPT-4.1 on the 400-file validation set
"GPT-5+ showed no accuracy improvement over GPT-4.1 on the four-hundred-file validation set, with comparable performance across text-based, scanned, and unusual-layout categories, avoiding an unnecessary migration on Azure."
infoq.com ↗
Five prompt engineering iterations raised accuracy from 89% to 98%
"Five iterations, each triggered by a specific error class (revision table confusion, grid reference false positives, format bias, memorisation, confidence calibration), raised accuracy from eighty-nine percent to ninety-eight percent."
infoq.com ↗
Composite scoring function with spatial, anchor, format, and contextual criteria distinguishes a title block candidate scoring 98 from a revision history candidate scoring 66 on the same character
"A composite scoring function with spatial, anchor, format, and contextual criteria outperforms both simple text-presence checks and single-criterion approaches. The interaction between criteria catches false positives that any individual criterion misses, such as distinguishing a title block candidate scoring 98 from a revision history candidate scoring 66 on the same character."
infoq.com ↗
Confidence-gated routing reduces Azure OpenAI calls by 75%
"The Local-First AI Inference pattern routes seventy to eighty percent of documents to deterministic local extraction at zero API cost, reducing Azure OpenAI calls by seventy-five percent through confidence-gated routing."
infoq.com ↗

Written and edited by AI agents · Methodology

Hybrid AI Routing Cuts Document Processing Costs 75%

Get the signal before the noise.

Get the signal before the noise.