RESEARCHBY AI|EXPERT SCOUT· Thursday, May 28, 2026· 4 MIN READ
Omega-QVLA Cuts Robot Vision Model Memory by 71% Without Retraining
New quantization technique achieves end-to-end compression of VLA models. Real-time on-device control for embodied agents without cloud calls or mixed-precision overhead.
Omega-QVLA introduces the first training-free framework to quantize both the LLM backbone and full diffusion action head of a vision-language-action model to uniform W4A4, achieving a 71.3% reduction in static memory and surpassing FP16 baseline task success for Pi 0.5 and GR00T N1.5.
Contrary to the previous consensus that a DiT action head is unstable under uniform quantization, Omega-QVLA employs a composite SVD-Hadamard rotation to balance per-channel weight energy and disperse residual activation outliers. It also uses a per-step activation scaling table to manage dynamic-range drift across DiT denoising steps, resulting in a uniform bit width end-to-end without retraining.
The implementation varies by model family. For GR00T N1.5, the Eagle LLM backbone operates with DuQuant A2-lite at runtime with RTN, while the DiT action head is offline-packed using rotation plus RTN and a per-step act_scale_table. Pi 0.5's PaliGemma backbone uses runtime A2-lite, with the action head switching to GPTQ for offline packing. Calibration requires only 10 samples, 8 denoising steps, and a 1024-token cap. However, the build hardware is not lean, with a minimum requirement of a single NVIDIA A100 40 GB, and long LIBERO suites needing 8 GPUs and approximately three hours, compared to 30 minutes for standard suites.
On LIBERO, the quantized Pi 0.5 achieves 98.0% task success against a 97.1% FP16 reference; GR00T N1.5 scores 87.8% against 87.0% FP16. The closest competing uniform-quantization entry, QuantVLA, reaches only 95.3% on Pi 0.5 at W4A4—2.7 percentage points behind Omega-QVLA—and is effectively tied on GR00T N1.5 at roughly 88%. The paper does not provide edge-inference metrics such as on-device latency, control-loop Hz, or wattage data. While the 71.3% memory reduction suggests real-time on-device control feasibility, architects must still bake the quantized tables on data-center A100s before porting to an edge SoC.
FIG. 02Task success rates (%) on LIBERO: Omega-QVLA quantized vs. FP16 baseline and QuantVLA competitor. Both models maintain ≥97% of original performance post-quantization.— Omega-QVLA paper (arXiv:2605.28803), QuantVLA paper (arXiv:2602.20309)
Integration friction is a hidden cost. The repository maintains separate conda environments—`custon_asr` for GR00T and `openpi` for Pi 0.5—and the PTQ recipe changes activation quantization modes between RTN and GPTQ depending on the action head. The per-step scaling table introduces runtime coupling to the DiT denoising schedule; a step-count mismatch or table truncation becomes a new failure mode. The eval is weighted toward LIBERO simulation, with limited real-world manipulation reported, leaving open questions about behavior under out-of-distribution visuals or long-horizon drift.