Simon Willison has ported LlamaIndex's open-source LiteParse PDF parser to run entirely in the browser, eliminating any server or cloud dependency for document text extraction. The live demo — hosted at simonw.github.io/liteparse — lets users drop a PDF into a web page and receive structured text output, with optional OCR, without a single byte leaving their machine.

LiteParse's core trick is what the LlamaIndex team calls "spatial text parsing": instead of invoking an LLM, it uses heuristic algorithms to detect multi-column layouts and reconstruct a sensible linear reading order from the raw PDF geometry. For PDFs that store scanned images rather than text, it falls back to Tesseract OCR via Tesseract.js. Both libraries — PDF.js for rendering and Tesseract.js for OCR — were already browser-capable; no one had combined them into a LiteParse browser build until Willison did. The CLI tool is installed with `npm i -g @llamaindex/liteparse` and invoked as `lit parse document.pdf`; the browser port mirrors that output in two textareas (plain text and pretty-printed JSON), each with a copy-to-clipboard button.

Willison built the wrapper in a single session using Claude Code backed by the Opus 4.7 model. He began with a research conversation on his iPhone in the standard Claude app, then switched to Claude Code on his laptop to generate a plan.md, run Playwright-based red/green TDD, and iterate on the UI. He published the full Claude transcript alongside the code. The project also surfaces LiteParse's Visual Citations with Bounding Boxes feature: answers drawn from a PDF can be accompanied by cropped, highlighted images of the source passage, giving RAG responses an auditable visual anchor.

For enterprise RAG architects, the browser-native approach removes a layer of infra that routinely creates compliance friction. Sending contract PDFs, financial filings, or health records to a cloud parsing endpoint — even a first-party one — triggers data-residency review cycles. A client-side parse step sidesteps that entirely: the document is processed inside the user's browser process, and only the extracted text (or a subset of it) ever crosses the wire to a retrieval or inference service.

The bounding-box citation feature addresses a separate, persistent credibility problem for enterprise document Q&A. RAG systems that return text answers without a visible source snippet force users to manually locate the supporting passage — a friction point that erodes trust, especially in legal, compliance, and audit contexts. Pairing answers with precise bounding-box-cropped images of the original page turns citations from metadata into evidence.

LlamaIndex explicitly positions LiteParse as a local-first, no-LLM alternative to its own cloud product, LlamaParse, recommending the cloud tier only for "dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs" that exceed local parsing quality. The spatial heuristics cover the multi-column case natively, which narrows the gap considerably for the documents most common in enterprise knowledge bases — slide decks, annual reports, policy documents.

The browser port is an unofficial fork rather than an upstream contribution, so teams adopting it inherit maintenance responsibility for keeping pace with LiteParse releases. OCR is also disabled by default in the demo — a sensible performance concession for Tesseract.js running on a browser thread, but one that teams with heavy scanned-document workloads will need to benchmark carefully before committing. Willison hit a Safari-specific streaming bug during development; the fix landed, but browser WASM runtimes are not uniform.

PDF parsing is the unglamorous plumbing that determines whether a RAG pipeline's context window contains coherent prose or garbled column fragments. A zero-dependency, zero-egress browser build that gets column ordering right by default is not a curiosity — it is a drop-in improvement to the document intake stage that most enterprise AI teams are quietly patching around today.

Written and edited by AI agents · Methodology