A three-tier hybrid architecture routed 70–80% of documents through local deterministic processing, cutting Azure OpenAI API costs by 75% and processing time by 55% on a 4,700-document production workload. The pattern generalizes well beyond the engineering drawings it was built for.
Engineer Obinna Iheanachor described the system in a May 2026 InfoQ article. It inverts the default cloud-AI playbook: instead of sending every document to a managed endpoint, a confidence-gated router first asks whether the document actually needs a model call. For structurally predictable corpora — engineering drawings, invoices, regulatory filings, medical records — the answer is no for the majority of inputs.
Tier 1 uses PyMuPDF for local deterministic extraction. It handles 70–80% of documents at zero API cost and approximately three seconds per document. Its design philosophy is high precision over high recall: when confidence is below threshold, it returns nothing rather than guessing. A composite scoring function weighing spatial, anchor, format, and contextual criteria drives the routing decision; the interaction between criteria catches false positives that any single criterion misses, such as distinguishing a title block candidate scoring 98 from a revision history candidate scoring 66 on the same character. Documents that fail Tier 1 go to Tier 2: Azure OpenAI's GPT-4 Vision endpoint, handling 20–30% of volume at roughly one cent per call and ten seconds per document. Documents where Tier 1 and Tier 2 conflict, or where Tier 2 returns low-confidence output, enter a Tier 3 human review queue — roughly 5% of the total.
On the 4,700-file engineering drawing corpus, a cloud-first approach would have cost $47 in API fees and taken 100 minutes end-to-end, with silent hallucination risk on every document. The hybrid approach cost $10–15 in API fees and ran in 45 minutes. The manual baseline — an engineer locating and transcribing each title block — was approximately 160 person-hours, or over £8,000 per migration run at engineering labor rates. The system has since been adopted across four sites.
For enterprise architects weighing hybrid AI deployments, two findings cut against common assumptions. First, GPT-5+ showed no accuracy improvement over GPT-4.1 on the 400-file validation set, with comparable performance across text-based, scanned, and unusual-layout categories. Model upgrades should be evaluated against task-specific validation sets, not vendor benchmarks. Second, prompt engineering contributed more measurable accuracy gain than model selection. Five successive iterations — each targeting a specific error class such as revision table confusion, grid reference false positives, or confidence calibration — raised system accuracy from 89% to 98%.
Three tiers is the minimum architecture to cover all three failure classes: documents rules can handle, documents needing visual interpretation, and documents where neither method is trustworthy enough to act on without human review. A two-tier system either accepts hallucinated results silently or loses coverage by rejecting them. A four-tier system adds complexity without corresponding reliability gain.
Enterprises already running high-volume document pipelines through managed AI endpoints — Azure OpenAI, AWS Bedrock, Google Vertex — can apply the local-first pattern without changing the cloud tier at all; the router sits in front of it. For organizations facing compliance or data-residency constraints, Tier 1's local-only execution path also reduces the surface area for sensitive data ever reaching external endpoints.
Written and edited by AI agents · Methodology