Simon Willison Ports LiteParse to the Browser for Zero-Egress PDF Parsing

Simon Willison has ported LlamaIndex's open-source LiteParse PDF parser to run entirely in the browser, eliminating any server or cloud dependency for document text extraction. The live demo — hosted at simonw.github.io/liteparse — lets users drop a PDF into a web page and receive structured text output, with optional OCR, without a single byte leaving their machine.

LiteParse's core trick is what the LlamaIndex team calls "spatial text parsing": instead of invoking an LLM, it uses heuristic algorithms to detect multi-column layouts and reconstruct a sensible linear reading order from the raw PDF geometry. For PDFs that store scanned images rather than text, it falls back to Tesseract OCR via Tesseract.js. Both libraries — PDF.js for rendering and Tesseract.js for OCR — were already browser-capable; no one had combined them into a LiteParse browser build until Willison did. The CLI tool is installed with `npm i -g @llamaindex/liteparse` and invoked as `lit parse document.pdf`; the browser port mirrors that output in two textareas (plain text and pretty-printed JSON), each with a copy-to-clipboard button.

Willison built the wrapper in a single session using Claude Code backed by the Opus 4.7 model. He began with a research conversation on his iPhone in the standard Claude app, then switched to Claude Code on his laptop to generate a plan.md, run Playwright-based red/green TDD, and iterate on the UI. He published the full Claude transcript alongside the code. The project also surfaces LiteParse's Visual Citations with Bounding Boxes feature: answers drawn from a PDF can be accompanied by cropped, highlighted images of the source passage, giving RAG responses an auditable visual anchor.

For enterprise RAG architects, the browser-native approach removes a layer of infra that routinely creates compliance friction. Sending contract PDFs, financial filings, or health records to a cloud parsing endpoint — even a first-party one — triggers data-residency review cycles. A client-side parse step sidesteps that entirely: the document is processed inside the user's browser process, and only the extracted text (or a subset of it) ever crosses the wire to a retrieval or inference service.

The bounding-box citation feature addresses a separate, persistent credibility problem for enterprise document Q&A. RAG systems that return text answers without a visible source snippet force users to manually locate the supporting passage — a friction point that erodes trust, especially in legal, compliance, and audit contexts. Pairing answers with precise bounding-box-cropped images of the original page turns citations from metadata into evidence.

LlamaIndex explicitly positions LiteParse as a local-first, no-LLM alternative to its own cloud product, LlamaParse, recommending the cloud tier only for "dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs" that exceed local parsing quality. The spatial heuristics cover the multi-column case natively, which narrows the gap considerably for the documents most common in enterprise knowledge bases — slide decks, annual reports, policy documents.

The browser port is an unofficial fork rather than an upstream contribution, so teams adopting it inherit maintenance responsibility for keeping pace with LiteParse releases. OCR is also disabled by default in the demo — a sensible performance concession for Tesseract.js running on a browser thread, but one that teams with heavy scanned-document workloads will need to benchmark carefully before committing. Willison hit a Safari-specific streaming bug during development; the fix landed, but browser WASM runtimes are not uniform.

PDF parsing is the unglamorous plumbing that determines whether a RAG pipeline's context window contains coherent prose or garbled column fragments. A zero-dependency, zero-egress browser build that gets column ordering right by default is not a curiosity — it is a drop-in improvement to the document intake stage that most enterprise AI teams are quietly patching around today.

Sources

Willison ported LiteParse to run entirely in the browser with no server or cloud dependency
"I got a version of LiteParse working entirely in the browser, using most of the same libraries that LiteParse uses to run in Node.js."
simonwillison.net ↗
LiteParse uses spatial text parsing heuristics to detect multi-column layouts without AI models
"Refreshingly, LiteParse doesn't use AI models to do what it does: it's good old-fashioned PDF parsing... They describe this as 'spatial text parsing'—they use some very clever heuristics to detect things like multi-column layouts and group and return the text in a sensible linear flow."
simonwillison.net ↗
LiteParse falls back to Tesseract OCR via Tesseract.js for image-based PDFs
"falling back to Tesseract OCR (or other pluggable OCR engines) for PDFs that contain images of text rather than the text itself"
simonwillison.net ↗
LiteParse is built on PDF.js and Tesseract.js
"it's built on top of PDF.js and Tesseract.js, two libraries I've used for something similar in a browser in the past."
simonwillison.net ↗
The live demo is at simonw.github.io/liteparse
"Visit https://simonw.github.io/liteparse/ to try out LiteParse against any PDF file, running entirely in your browser."
simonwillison.net ↗
LiteParse CLI is installed with npm i -g @llamaindex/liteparse and invoked as lit parse document.pdf
"npm i -g @llamaindex/liteparse lit parse document.pdf"
simonwillison.net ↗
Willison built the browser port using Claude Code and Opus 4.7
"Building it with Claude Code and Opus 4.7"
simonwillison.net ↗
Willison used Playwright-based red/green TDD during development
"When you implement this use playwright and red/green TDD, plan that too"
simonwillison.net ↗
LiteParse's Visual Citations with Bounding Boxes feature pairs answers with cropped highlighted images of source passages
"The LiteParse documentation describes a pattern for implementing Visual Citations with Bounding Boxes. I really like this idea: being able to answer questions from a PDF and accompany those answers with cropped, highlighted images feels like a great way of increasing the credibility of answers from RAG-style Q&A."
simonwillison.net ↗
LlamaIndex recommends its cloud LlamaParse tier only for dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs
"For complex documents (dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs), you'll get significantly better results with LlamaParse, our cloud-based document parser built for production document pipelines."
github.com ↗
LiteParse provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies
"It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. Everything runs locally on your machine."
github.com ↗
Willison hit a Safari-specific streaming bug during development that was subsequently fixed
"When I try to parse a PDF in my browser I see 'Parse failed: undefined is not a function (near '...value of readableStream...')—it was testing with Playwright in Chrome, turned out there was a bug in Safari"
simonwillison.net ↗

Written and edited by AI agents · Methodology