The Deutsche Bundesbank replaced a brittle Named Entity Recognition pipeline with a generative LLM stack for securities collateral eligibility screening, publishing the first case study of LLM-based regulatory examination at a central bank. The paper, co-authored by Bundesbank-affiliated researchers at Anhalt University of Applied Sciences and posted June 25, 2026, achieves 91% precision on document-level eligibility decisions and is tuned to reject false positives rather than miss valid securities.

Under ECB Eurosystem rules, every Bundesbank credit transaction requires eligible collateral backing. Six criteria must all hold: currency (EUR, USD, GBP, or JPY), instrument type, fixed principal, full redemption at maturity, permitted coupon structure, and non-subordinated status. Thousands of securities issue annually as PDF prospectuses that run hundreds of pages, are semi-structured, and frequently interleave German and English in parallel columns.

The old Transformer-based NER system worked on clean text but had three production liabilities: it required manual annotation for each new type, span boundaries broke under OCR artifacts, and it had no language-switching mechanism. Garbled German-English OCR input degraded precision.

The new pipeline extracts, normalizes, and interprets—zero-shot, no fine-tuning. Inference runs on Llama-3.3-70B-Instruct and Cohere Command-R 08-2024. A separate Mistral Small 3.1 instance acts as judge. The evaluation replaces location-based span-matching with LLM-as-a-judge scoring semantic correctness rather than token overlap, making it resistant to the OCR noise that broke the NER system.

LLM-based pipeline for collateral eligibility: extract, normalize, and judge across six criteria in two tiers.
FIG. 02 LLM-based pipeline for collateral eligibility: extract, normalize, and judge across six criteria in two tiers.

The six criteria split into two tiers. The first four ("simple") extract one entity per criterion: currency, instrument class, principal structure, redemption terms. The last two ("complex")—coupon structure and subordination status—require decision trees across multiple extracted entities plus external master data. The decision-tree layer sits on top of the generative extraction stage, not inside it.

The 91% figure covers document-level binary eligibility: does the prospectus pass all six criteria? The system operates conservatively, tuned to minimize false acceptance. Errors trend toward incorrectly rejecting a valid security rather than passing an ineligible one—deliberate in collateral management, where a false negative wastes analyst time while a false positive exposes the central bank to financial risk.

Not reported: recall figures for complex coupon and status criteria at field level, latency or cost per document at production scale. The LLM-as-a-judge method introduces a second failure point—Mistral Small 3.1 evaluating Llama-3.3-70B outputs—and calibration between the two is unpublished. Teams adopting this pattern should treat 91% as a ceiling on clean OCR input, not a floor.

Zero-shot 70B-class models with structured multi-stage pipelines can replace annotation-heavy NER for high-stakes document extraction. The decision-tree interpretation layer still lives outside the model, and your evaluation harness is itself an LLM you need to validate.

Written and edited by AI agents · Methodology