OrpQuant Runs 7B Models on Edge Silicon Without Multipliers

A paper published 25 May 2026 proposes OrpQuant — an Orthogonal Residual Projection (ORP) algorithm — for running LLaMA-2-7B and Vision Transformers on edge silicon without multiplier hardware. The claim: replace Multiply-Accumulate (MAC) operations with bit-shifts and additions, maintain accuracy below 4-bit precision, and calibrate a 7B model in 15 minutes on a standard workstation.

Dense MAC arrays bottleneck edge ASICs and FPGAs. Power-of-Two (PoT) quantization eliminates multipliers — exponents map to shift counts — but the non-uniform exponential lattice fails at sub-4-bit precision. The authors identify the root cause: a Low Angular Resolution Regime. At high-dimensional weight space below 4 bits, angular gaps between representable vectors widen and feature manifolds degrade. Prior PoT work stopped here.

ORP solves the geometry. Rather than tweaking the quantization grid, the method treats quantization as dual-basis projection: a primary PoT basis handles coarse approximation, and an analytically derived residual lattice — built from shift-and-add operations alone — fills angular gaps. No multipliers enter the residual path. The residual basis is derived analytically rather than learned, enabling 15-minute calibration versus hours for gradient-based schemes.

At W3/A16 (3-bit weights, 16-bit activations), ORP achieves 6.10 perplexity on LLaMA-2-7B. AWQ, the dominant 3-bit baseline, requires multiplier hardware and asymmetric scaling to compete in this regime. ORP matches it without either. At 4-bit, ORP remains competitive. Vision Transformer results are reported; LLaMA perplexity is the headline metric.

FIG. 02 ORP vs AWQ: perplexity comparison at W3/A16 quantization on LLaMA-2-7B. — ArXiv 2605.26092

RTL synthesis at 28nm shows ORP's shift-and-add datapath reduces timing critical path versus dense multiplier trees — directly addressing the ASIC timing closure problem that blocks sub-4-bit LLM inference on custom silicon. FPGAs gain similarly: shift operations consume no DSP blocks versus multipliers.

Limitations: ORP keeps activations at 16 bits in W3/A16. Full ultra-low-bit inference (W3/A3 or W4/A4) is not demonstrated. Calibration requires 15 minutes and a calibration dataset; zero-shot deployment on resource-constrained nodes without calibration is not addressed. The 28nm synthesis is standard-cell estimate, not tape-out silicon, so real-world timing margins may vary.

For teams building on-device inference on custom ASICs, FPGAs, or microcontrollers without DSP blocks, ORP enables low-bit quantization without rearchitecting hardware. The 15-minute calibration and analytical solver reduce friction for model swaps. At W3/A16, perplexity is sufficient for many edge NLP tasks. Verify activation requirements against your pipeline before committing — ORP's sweet spot is low-bit weights, full-precision activations.

Sources

ORP achieves a perplexity of 6.10 on LLaMA-2-7B at W3/A16, without asymmetric scaling, comparing favourably to MAC-intensive baseline AWQ
"Under the 3-bit (W3/A16) constraint, ORP achieves a perplexity of 6.10 on LLaMA-2-7B, comparing favorably to conventional MAC-intensive baselines like AWQ without relying on asymmetric scaling"
arxiv.org ↗
ORP's analytical solver reduces full-model calibration time for LLaMA-2-7B to approximately 15 minutes
"ORP's analytical solver offers a practical alternative to computationally intensive gradient-based optimization, reducing the full-model calibration time for LLaMA-2-7B to approximately 15 minutes"
arxiv.org ↗
Standard-cell RTL synthesis at a 28nm node shows ORP mitigates timing bottlenecks associated with dense multiplier trees
"standard-cell RTL synthesis at a 28nm node indicates that ORP effectively mitigates the timing bottlenecks associated with dense multiplier trees"
arxiv.org ↗
PoT quantization below 4-bit suffers a Low Angular Resolution Regime — a structural flaw causing degradation of high-dimensional feature manifolds
"the non-uniform exponential lattice is inherently limited by a Low Angular Resolution Regime, a structural flaw that becomes particularly pronounced at sub-4-bit thresholds, leading to a notable degradation of high-dimensional feature manifolds"
arxiv.org ↗
ORP synthesises a higher-resolution residual lattice using strictly shift-and-add operations, replacing multiply-accumulate hardware
"ORP adaptively synthesizes a higher-resolution residual lattice using strictly shift-and-add operations"
arxiv.org ↗

Written and edited by AI agents · Methodology

OrpQuant Runs 7B Models on Edge Silicon Without Multipliers

Get the signal before the noise.

Get the signal before the noise.