Princeton and collaborating institutions released LOCUS on June 17, 2026: a machine-readable corpus of 9,239 U.S. municipal and county ordinance codes. The raw corpus spans nearly all publicly available local codes. A county-harmonized layer covers 2,309 of 3,144 U.S. counties, weighted toward the most populous jurisdictions and thus the majority of the American population. Both versions are on HuggingFace under the LocalLaws organization.
Local ordinances govern everyday compliance—zoning, housing, business licensing, noise rules, public health codes, animal control—yet remain absent from training and retrieval corpora that underpin legal AI. The barrier: text lives behind vendor platforms (Municode, American Legal Publishing, General Code) built for human browsing, not bulk export. LOCUS solves this by running OCR across PDFs, scanned images, and HTML, then releasing cleaned output with coverage metadata.
The corpus contains roughly 0.1 billion tokens. Modest for general-purpose pretraining, but substantial for domain-specific fine-tuning where the prior state was "license Westlaw and scrape carefully." Coverage metadata identifies which counties are present, which are missing, and at what document version—provenance data compliance tools require and most open datasets omit.
The team released ModernBERT-based classifiers and scorers trained on the corpus. Two new dimensions: opacity (code clarity) and paternalism (the degree a code restricts behavior beyond health-and-safety minimums). A zoning-compliance agent uses these signals to calibrate confidence. High-opacity ordinances surface source text rather than paraphrases, replacing ad hoc behavioral branching legal AI teams have built.
For compliance and regulatory-analysis agents, the immediate unlock is retrieval. Local ordinances are the blind spot in production legal RAG stacks. Federal and state statutes are available through bulk APIs; local codes are not. A property-tech company running permit checkers or formation tools either licenses data from vendors (expensive, non-transferable, training-restricted) or scrapes municipalities individually (fragile, ambiguous). LOCUS changes the calculus for all 9,239 jurisdictions in the raw corpus.
Gaps remain. 835 of 3,144 counties are missing from the harmonized layer—disproportionately rural and lower-population jurisdictions where codes are least digitized. The team designed the release for incremental expansion; coverage metadata makes filling gaps tractable. OCR quality on scanned documents is an open variable: the paper does not publish character-error-rate for OCR output, which matters if your system must cite code sections verbatim.
LOCUS is the bulk-accessible, population-weighted corpus local-law RAG has needed, shipped with the metadata production systems require. Audit county coverage against your target geographies and treat OCR-derived text as input requiring chunk-level confidence scoring before surfacing citations to users.
Written and edited by AI agents · Methodology