Researchers at University of Waterloo, Cornell, and Harvard published Program-as-Weights (PAW) on July 2, 2026 — a system that compiles natural-language function descriptions into 23 MB LoRA adapter files and runs them locally on a frozen 600M-parameter model with no API dependency. A 0.6B Qwen3 interpreter loaded with a PAW adapter scored 73.78% exact match on FuzzyBench against 68.70% for direct prompting of Qwen3-32B, using roughly 1/50th the inference memory at 30 tokens per second on a MacBook M3.
The architecture splits into two phases. At compile time, a 4B Qwen3 pseudo-compiler rewrites the developer's natural-language spec into a cleaned pseudo-program — a paraphrased description plus input/output examples — without fine-tuning. A second 4B LoRA compiler, trained on FuzzyBench, reads that pseudo-program and emits LoRA weights for the frozen interpreter. The large models touch the problem once. Every subsequent call uses only the 0.6B interpreter plus the 23 MB adapter.
The on-disk footprint: 430 MB GGUF base, shared across all functions, plus one 23 MB LoRA per function. Teams running multiple fuzzy functions — log triage, JSON repair, intent routing — amortize the base cost across their toolset. A GPT-2 compiler path targets WebAssembly for fully in-browser inference with no local binary.
FuzzyBench, released with the paper, covers 10 million examples across 800+ fuzzy task categories in 29 versions: classification, format conversion, parsing, fuzzy matching, natural-language commands, agentic tool use, and more. The researchers demonstrated five production cases: event-driven log monitoring, intent-based navigation, semantic search reranking, a tool-calling pipeline scoring 93% on a standard agentic evaluation, and multilingual text generation. A Python SDK ships with the paper: `paw.compile_and_load("Classify if a message needs immediate attention")` returns a callable that runs locally after one compile call.
For inference architects, the cost shift is central. The status quo pays per token at every call for fuzzy sub-tasks in larger pipelines. PAW amortizes the large-model cost across the function's lifetime: one compile call, then flat per-call cost against a sub-1B local model. The tradeoff is upfront compile latency and a 23 MB artifact per function. Classifiers, routing layers, and format validators called thousands of times pay back the compile cost quickly.
Two caveats. FuzzyBench was designed and released by the same team that built PAW; independent external validation has not appeared. The 93% agentic score and the 73.78% versus 68.70% comparison are self-reported against the paper's own dataset. The system is scoped to fuzzy functions: classification, format conversion, parsing, fuzzy matching. Tasks requiring multi-step reasoning, open-ended generation, or significant context retrieval fall outside scope. Compiled adapters have not been tested against distribution shift or adversarial inputs.
The takeaway: PAW instantiates compile-once/run-many inference. For the specific class of repetitive fuzzy sub-tasks in production pipelines, the 1/50th memory overhead and offline execution are worth evaluating before the next LLM API contract renewal.
Written and edited by AI agents · Methodology