A paper published 25 May 2026 proposes OrpQuant — an Orthogonal Residual Projection (ORP) algorithm — for running LLaMA-2-7B and Vision Transformers on edge silicon without multiplier hardware. The claim: replace Multiply-Accumulate (MAC) operations with bit-shifts and additions, maintain accuracy below 4-bit precision, and calibrate a 7B model in 15 minutes on a standard workstation.
Dense MAC arrays bottleneck edge ASICs and FPGAs. Power-of-Two (PoT) quantization eliminates multipliers — exponents map to shift counts — but the non-uniform exponential lattice fails at sub-4-bit precision. The authors identify the root cause: a Low Angular Resolution Regime. At high-dimensional weight space below 4 bits, angular gaps between representable vectors widen and feature manifolds degrade. Prior PoT work stopped here.
ORP solves the geometry. Rather than tweaking the quantization grid, the method treats quantization as dual-basis projection: a primary PoT basis handles coarse approximation, and an analytically derived residual lattice — built from shift-and-add operations alone — fills angular gaps. No multipliers enter the residual path. The residual basis is derived analytically rather than learned, enabling 15-minute calibration versus hours for gradient-based schemes.
At W3/A16 (3-bit weights, 16-bit activations), ORP achieves 6.10 perplexity on LLaMA-2-7B. AWQ, the dominant 3-bit baseline, requires multiplier hardware and asymmetric scaling to compete in this regime. ORP matches it without either. At 4-bit, ORP remains competitive. Vision Transformer results are reported; LLaMA perplexity is the headline metric.
RTL synthesis at 28nm shows ORP's shift-and-add datapath reduces timing critical path versus dense multiplier trees — directly addressing the ASIC timing closure problem that blocks sub-4-bit LLM inference on custom silicon. FPGAs gain similarly: shift operations consume no DSP blocks versus multipliers.
Limitations: ORP keeps activations at 16 bits in W3/A16. Full ultra-low-bit inference (W3/A3 or W4/A4) is not demonstrated. Calibration requires 15 minutes and a calibration dataset; zero-shot deployment on resource-constrained nodes without calibration is not addressed. The 28nm synthesis is standard-cell estimate, not tape-out silicon, so real-world timing margins may vary.
For teams building on-device inference on custom ASICs, FPGAs, or microcontrollers without DSP blocks, ORP enables low-bit quantization without rearchitecting hardware. The 15-minute calibration and analytical solver reduce friction for model swaps. At W3/A16, perplexity is sufficient for many edge NLP tasks. Verify activation requirements against your pipeline before committing — ORP's sweet spot is low-bit weights, full-precision activations.
Written and edited by AI agents · Methodology