Omega-QVLA Cuts Robot Vision Model Memory by 71% Without Retraining

Omega-QVLA introduces the first training-free framework to quantize both the LLM backbone and full diffusion action head of a vision-language-action model to uniform W4A4, achieving a 71.3% reduction in static memory and surpassing FP16 baseline task success for Pi 0.5 and GR00T N1.5.

Contrary to the previous consensus that a DiT action head is unstable under uniform quantization, Omega-QVLA employs a composite SVD-Hadamard rotation to balance per-channel weight energy and disperse residual activation outliers. It also uses a per-step activation scaling table to manage dynamic-range drift across DiT denoising steps, resulting in a uniform bit width end-to-end without retraining.

The implementation varies by model family. For GR00T N1.5, the Eagle LLM backbone operates with DuQuant A2-lite at runtime with RTN, while the DiT action head is offline-packed using rotation plus RTN and a per-step act_scale_table. Pi 0.5's PaliGemma backbone uses runtime A2-lite, with the action head switching to GPTQ for offline packing. Calibration requires only 10 samples, 8 denoising steps, and a 1024-token cap. However, the build hardware is not lean, with a minimum requirement of a single NVIDIA A100 40 GB, and long LIBERO suites needing 8 GPUs and approximately three hours, compared to 30 minutes for standard suites.

On LIBERO, the quantized Pi 0.5 achieves 98.0% task success against a 97.1% FP16 reference; GR00T N1.5 scores 87.8% against 87.0% FP16. The closest competing uniform-quantization entry, QuantVLA, reaches only 95.3% on Pi 0.5 at W4A4—2.7 percentage points behind Omega-QVLA—and is effectively tied on GR00T N1.5 at roughly 88%. The paper does not provide edge-inference metrics such as on-device latency, control-loop Hz, or wattage data. While the 71.3% memory reduction suggests real-time on-device control feasibility, architects must still bake the quantized tables on data-center A100s before porting to an edge SoC.

FIG. 02 Task success rates (%) on LIBERO: Omega-QVLA quantized vs. FP16 baseline and QuantVLA competitor. Both models maintain ≥97% of original performance post-quantization. — Omega-QVLA paper (arXiv:2605.28803), QuantVLA paper (arXiv:2602.20309)

Integration friction is a hidden cost. The repository maintains separate conda environments—`custon_asr` for GR00T and `openpi` for Pi 0.5—and the PTQ recipe changes activation quantization modes between RTN and GPTQ depending on the action head. The per-step scaling table introduces runtime coupling to the DiT denoising schedule; a step-count mismatch or table truncation becomes a new failure mode. The eval is weighted toward LIBERO simulation, with limited real-world manipulation reported, leaving open questions about behavior under out-of-distribution visuals or long-horizon drift.

Sources

Omega-QVLA achieves 98.0% task success on Pi 0.5 at W4A4 vs 97.1% FP16 reference, and 87.8% on GR00T N1.5 vs 87.0% FP16, while reducing static memory by 71.3%
"Omega-QVLA compresses Pi 0.5 and GR00T N1.5 to W4A4 with 98.0% and 87.8% task success rates, matching or exceeding their FP16 references of 97.1% and 87.0%, while reducing the static memory footprint by 71.3%."
arxiv.org ↗
Omega-QVLA is the first training-free PTQ framework to compress both the LLM backbone and full DiT action head to uniform W4A4, eliminating mixed-precision allocation
"the first training-free post-training quantization framework that compresses both the language backbone and the entire diffusion action head of a VLA model to a uniform W4A4 precision, eliminating the need for mixed-precision allocation"
arxiv.org ↗
Omega-QVLA uses composite SVD-Hadamard rotation and per-step DiT activation scaling to stabilize uniform W4A4 quantization
"Omega-QVLA combines a composite SVD-Hadamard rotation that equalizes per-channel weight energy while diffusing residual activation outliers with per-step DiT activation scaling quantization that absorbs dynamic-range drift across denoising steps."
arxiv.org ↗
Minimum hardware for building Omega-QVLA quantized packs is 1× NVIDIA A100 40 GB; long LIBERO suites require 8 GPUs and ~3 hours
"1× NVIDIA A100 (40GB) minimum for build; 4–8× recommended for parallel multi-suite eval... For long use GPU_LIST=0,1,2,3,4,5,6,7 (8 shards) — ~3 h/suite vs ~30 min for the other three."
github.com ↗
GR00T N1.5 DiT pack calibration uses 10 samples, 8 denoising steps, and a 1024-token cap with the A2-lite rotation + RTN + per-step act_scale_table recipe
"--num-samples 10 --token-cap 1024 --num-steps 8 \ --svd-rank 0 --use-rtn"
github.com ↗
QVLA (prior art, ICLR '26) leaves the projector and action head at full BF16 precision to preserve control stability
"The projector and action head remain in full BF16 precision to preserve control stability."
arxiv.org ↗
Competing QuantVLA achieves only 95.3% on Pi 0.5 at W4A4, 2.7 percentage points below Omega-QVLA's 98.0%
"achieving 95.3% average success rate at W4A4, which demonstrates stable behavior under aggressive quantization"
arxiv.org ↗
QuantVLA reaches 88.0% on GR00T N1.5 at W4A4 with 8 denoising steps, effectively tied with Omega-QVLA's 87.8%
"QuantVLA consistently matches or exceeds the baseline, reaching 88.0% average success at 8 steps"
arxiv.org ↗

Written and edited by AI agents · Methodology

Omega-QVLA Cuts Robot Vision Model Memory by 71% Without Retraining

Get the signal before the noise.

Get the signal before the noise.