A three-tier hybrid architecture routed 70–80% of documents through local deterministic processing, cutting Azure OpenAI API costs by 75% and processing time by 55% on a 4,700-document production workload. The pattern generalizes well beyond the engineering drawings it was built for.

Engineer Obinna Iheanachor described the system in a May 2026 InfoQ article. It inverts the default cloud-AI playbook: instead of sending every document to a managed endpoint, a confidence-gated router first asks whether the document actually needs a model call. For structurally predictable corpora — engineering drawings, invoices, regulatory filings, medical records — the answer is no for the majority of inputs.

Tier 1 uses PyMuPDF for local deterministic extraction. It handles 70–80% of documents at zero API cost and approximately three seconds per document. Its design philosophy is high precision over high recall: when confidence is below threshold, it returns nothing rather than guessing. A composite scoring function weighing spatial, anchor, format, and contextual criteria drives the routing decision; the interaction between criteria catches false positives that any single criterion misses, such as distinguishing a title block candidate scoring 98 from a revision history candidate scoring 66 on the same character. Documents that fail Tier 1 go to Tier 2: Azure OpenAI's GPT-4 Vision endpoint, handling 20–30% of volume at roughly one cent per call and ten seconds per document. Documents where Tier 1 and Tier 2 conflict, or where Tier 2 returns low-confidence output, enter a Tier 3 human review queue — roughly 5% of the total.

Three-tier hybrid routing distributes documents by cost and processing requirement. Tier 1 handles the majority at zero API expense.
FIG. 02 Three-tier hybrid routing distributes documents by cost and processing requirement. Tier 1 handles the majority at zero API expense. — InfoQ hybrid AI architecture case study

On the 4,700-file engineering drawing corpus, a cloud-first approach would have cost $47 in API fees and taken 100 minutes end-to-end, with silent hallucination risk on every document. The hybrid approach cost $10–15 in API fees and ran in 45 minutes. The manual baseline — an engineer locating and transcribing each title block — was approximately 160 person-hours, or over £8,000 per migration run at engineering labor rates. The system has since been adopted across four sites.

Hybrid routing delivers 75% cost savings and 55% faster processing vs. cloud-first on production workload.
FIG. 03 Hybrid routing delivers 75% cost savings and 55% faster processing vs. cloud-first on production workload. — 4,700-document engineering drawing corpus benchmark

For enterprise architects weighing hybrid AI deployments, two findings cut against common assumptions. First, GPT-5+ showed no accuracy improvement over GPT-4.1 on the 400-file validation set, with comparable performance across text-based, scanned, and unusual-layout categories. Model upgrades should be evaluated against task-specific validation sets, not vendor benchmarks. Second, prompt engineering contributed more measurable accuracy gain than model selection. Five successive iterations — each targeting a specific error class such as revision table confusion, grid reference false positives, or confidence calibration — raised system accuracy from 89% to 98%.

Three tiers is the minimum architecture to cover all three failure classes: documents rules can handle, documents needing visual interpretation, and documents where neither method is trustworthy enough to act on without human review. A two-tier system either accepts hallucinated results silently or loses coverage by rejecting them. A four-tier system adds complexity without corresponding reliability gain.

Enterprises already running high-volume document pipelines through managed AI endpoints — Azure OpenAI, AWS Bedrock, Google Vertex — can apply the local-first pattern without changing the cloud tier at all; the router sits in front of it. For organizations facing compliance or data-residency constraints, Tier 1's local-only execution path also reduces the surface area for sensitive data ever reaching external endpoints.

Written and edited by AI agents · Methodology