PAW Trades Compile Time for 1/50th the Inference Memory

Researchers at University of Waterloo, Cornell, and Harvard published Program-as-Weights (PAW) on July 2, 2026 — a system that compiles natural-language function descriptions into 23 MB LoRA adapter files and runs them locally on a frozen 600M-parameter model with no API dependency. A 0.6B Qwen3 interpreter loaded with a PAW adapter scored 73.78% exact match on FuzzyBench against 68.70% for direct prompting of Qwen3-32B, using roughly 1/50th the inference memory at 30 tokens per second on a MacBook M3.

The architecture splits into two phases. At compile time, a 4B Qwen3 pseudo-compiler rewrites the developer's natural-language spec into a cleaned pseudo-program — a paraphrased description plus input/output examples — without fine-tuning. A second 4B LoRA compiler, trained on FuzzyBench, reads that pseudo-program and emits LoRA weights for the frozen interpreter. The large models touch the problem once. Every subsequent call uses only the 0.6B interpreter plus the 23 MB adapter.

The on-disk footprint: 430 MB GGUF base, shared across all functions, plus one 23 MB LoRA per function. Teams running multiple fuzzy functions — log triage, JSON repair, intent routing — amortize the base cost across their toolset. A GPT-2 compiler path targets WebAssembly for fully in-browser inference with no local binary.

FIG. 02 PAW's total footprint (430 MB shared base + 23 MB per-function LoRA) versus a full 0.6B model baseline, achieving ~50× compression. — PAW paper, https://arxiv.org/html/2607.02512

FuzzyBench, released with the paper, covers 10 million examples across 800+ fuzzy task categories in 29 versions: classification, format conversion, parsing, fuzzy matching, natural-language commands, agentic tool use, and more. The researchers demonstrated five production cases: event-driven log monitoring, intent-based navigation, semantic search reranking, a tool-calling pipeline scoring 93% on a standard agentic evaluation, and multilingual text generation. A Python SDK ships with the paper: `paw.compile_and_load("Classify if a message needs immediate attention")` returns a callable that runs locally after one compile call.

For inference architects, the cost shift is central. The status quo pays per token at every call for fuzzy sub-tasks in larger pipelines. PAW amortizes the large-model cost across the function's lifetime: one compile call, then flat per-call cost against a sub-1B local model. The tradeoff is upfront compile latency and a 23 MB artifact per function. Classifiers, routing layers, and format validators called thousands of times pay back the compile cost quickly.

FIG. 03 PAW's cost model: heavy compile-time investment amortized over many inference calls, eliminating per-token billing at runtime. — ai|expert analysis

Two caveats. FuzzyBench was designed and released by the same team that built PAW; independent external validation has not appeared. The 93% agentic score and the 73.78% versus 68.70% comparison are self-reported against the paper's own dataset. The system is scoped to fuzzy functions: classification, format conversion, parsing, fuzzy matching. Tasks requiring multi-step reasoning, open-ended generation, or significant context retrieval fall outside scope. Compiled adapters have not been tested against distribution shift or adversarial inputs.

The takeaway: PAW instantiates compile-once/run-many inference. For the specific class of repetitive fuzzy sub-tasks in production pipelines, the 1/50th memory overhead and offline execution are worth evaluating before the next LLM API contract renewal.

Sources

PAW 0.6B interpreter scores 73.78% exact match on FuzzyBench vs. 68.70% for direct prompting of Qwen3-32B, at roughly 1/50th inference memory and 30 tokens/s on MacBook M3
"A Qwen3-0.6B interpreter executing PAW programs outperforms direct prompting of Qwen3-32B (73.78% vs. 68.70% exact match) at roughly one fiftieth the inference memory."
arxiv.org ↗
PAW compiles natural-language function specs into compact, locally-executable LoRA adapters using a 4B compiler trained on FuzzyBench (10M examples)
"a 4B compiler trained on FuzzyBench, a 10M-example dataset we release, emits parameter-efficient adapters for a frozen, lightweight interpreter"
arxiv.org ↗
Artifact footprint is 430 MB GGUF base shared across all functions plus a 23 MB per-program LoRA adapter; quantized system runs at 30 tokens/s on MacBook M3
"runs at 30 tokens per second on a MacBook M3 from a ∼430 MB GGUF base shared across functions plus a 23 MB per-program LoRA adapter"
arxiv.org ↗
A GPT-2 compiler path runs entirely in-browser via WebAssembly
"a smaller GPT-2 path runs entirely client-side in the browser via WebAssembly"
arxiv.org ↗
Two-stage compile pipeline: pseudo-compiler (off-the-shelf 4B Qwen3, not fine-tuned) then LoRA compiler (trained 4B Qwen3) that emits LoRA weights for the frozen 0.6B interpreter
"The first stage is a pseudo compiler, an off-the-shelf model we never train: prompted with a small task-rewriting template, it turns the user's spec into a clean pseudo-program... The second stage is a LoRA compiler that we train: it reads the spec and the pseudo-program and emits the LoRA."
arxiv.org ↗
FuzzyBench covers 800+ fuzzy task categories in 29 thematic versions including classification, format conversion, parsing, agentic tool use, and more
"built incrementally across 29 thematic versions covering more than 800 categories of fuzzy text tasks such as classification, format conversion, parsing, fuzzy matching, natural-language commands, agentic tool use, and many more"
arxiv.org ↗
Five production use cases demonstrated: log monitoring, site navigation, search reranking, agentic tool-calling (93% on standard agentic eval), and multilingual text generation
"event-driven log monitoring (output triage), intent-based site navigation (custom classification), semantic search reranking (fuzzy search), a tool-calling pipeline that scored 93% on a standard agentic evaluation (agent preprocessing), and a multilingual word-guessing game (creative generation)"
ibtimes.com ↗
FuzzyBench was designed and released by the PAW team itself; independent external benchmark validation has not yet appeared
"The FuzzyBench benchmark covered classification, format conversion, parsing, fuzzy matching, and agentic tool-use categories, but it was designed and released by the same team that built PAW."
ibtimes.com ↗
Python SDK available: paw.compile_and_load() compiles a spec and returns a local callable requiring no API keys at runtime
"fn = paw.compile_and_load("Classify if a message needs immediate attention or can wait") # After compilation, inference runs locally with no API calls."
github.com ↗

Written and edited by AI agents · Methodology

PAW Trades Compile Time for 1/50th the Inference Memory

Get the signal before the noise.

Get the signal before the noise.