Bundesbank Hits 91% Accuracy on Automated Collateral Eligibility

The Deutsche Bundesbank replaced a brittle Named Entity Recognition pipeline with a generative LLM stack for securities collateral eligibility screening, publishing the first case study of LLM-based regulatory examination at a central bank. The paper, co-authored by Bundesbank-affiliated researchers at Anhalt University of Applied Sciences and posted June 25, 2026, achieves 91% precision on document-level eligibility decisions and is tuned to reject false positives rather than miss valid securities.

Under ECB Eurosystem rules, every Bundesbank credit transaction requires eligible collateral backing. Six criteria must all hold: currency (EUR, USD, GBP, or JPY), instrument type, fixed principal, full redemption at maturity, permitted coupon structure, and non-subordinated status. Thousands of securities issue annually as PDF prospectuses that run hundreds of pages, are semi-structured, and frequently interleave German and English in parallel columns.

The old Transformer-based NER system worked on clean text but had three production liabilities: it required manual annotation for each new type, span boundaries broke under OCR artifacts, and it had no language-switching mechanism. Garbled German-English OCR input degraded precision.

The new pipeline extracts, normalizes, and interprets—zero-shot, no fine-tuning. Inference runs on Llama-3.3-70B-Instruct and Cohere Command-R 08-2024. A separate Mistral Small 3.1 instance acts as judge. The evaluation replaces location-based span-matching with LLM-as-a-judge scoring semantic correctness rather than token overlap, making it resistant to the OCR noise that broke the NER system.

FIG. 02 LLM-based pipeline for collateral eligibility: extract, normalize, and judge across six criteria in two tiers.

The six criteria split into two tiers. The first four ("simple") extract one entity per criterion: currency, instrument class, principal structure, redemption terms. The last two ("complex")—coupon structure and subordination status—require decision trees across multiple extracted entities plus external master data. The decision-tree layer sits on top of the generative extraction stage, not inside it.

The 91% figure covers document-level binary eligibility: does the prospectus pass all six criteria? The system operates conservatively, tuned to minimize false acceptance. Errors trend toward incorrectly rejecting a valid security rather than passing an ineligible one—deliberate in collateral management, where a false negative wastes analyst time while a false positive exposes the central bank to financial risk.

Not reported: recall figures for complex coupon and status criteria at field level, latency or cost per document at production scale. The LLM-as-a-judge method introduces a second failure point—Mistral Small 3.1 evaluating Llama-3.3-70B outputs—and calibration between the two is unpublished. Teams adopting this pattern should treat 91% as a ceiling on clean OCR input, not a floor.

Zero-shot 70B-class models with structured multi-stage pipelines can replace annotation-heavy NER for high-stakes document extraction. The decision-tree interpretation layer still lives outside the model, and your evaluation harness is itself an LLM you need to validate.

Sources

LLM-based systems achieve up to 91% precision in document-level eligibility determination at the Deutsche Bundesbank
"Our results demonstrate that LLM-based systems achieve high precision (up to 91%) in document-level eligibility, exhibiting a conservative operating profile that minimizes false acceptance."
arxiv.org ↗
The pipeline uses Llama-3.3-70B-Instruct and Cohere Command-R 08-2024 for inference, with Mistral Small 3.1 Instruct as the LLM judge
"our study focuses on the zero-shot and instruction-following capabilities of high-performance general-purpose models: Llama-3.3-70B-Instruct and Cohere Command-R 08-2024 for inference, and Mistral Small 3.1 Instruct for evaluation"
arxiv.org ↗
The task decomposes into six eligibility criteria — currency, instrument type, principal amount, redemption at maturity, coupon structure, and subordination status — all of which must be satisfied
"eligibility is determined by 6 criteria, all of which must be fulfilled for the prospectus to be eligible"
arxiv.org ↗
The prior NER-based system required extensive manual annotation and was fragile under OCR artifacts and rigid span boundaries
"that approach introduced several constraints, primarily: it required extensive manual annotation to provide necessary supervision for all relevant annotation types, and the resulting models were sensitive to the rigid boundaries of text spans (which made them fragile when encountering OCR artifacts or financial language different from its training set)"
arxiv.org ↗
The new pipeline replaces location-based span metrics with a value-based LLM-as-a-judge evaluation resistant to OCR noise
"Introducing a value-based evaluation methodology using LLM-as-a-judge, resistant to OCR noise and linguistic variance"
arxiv.org ↗
Prospectuses are PDF files that can run hundreds of pages, are semi-structured, and frequently bilingual with German and English interleaved or in parallel columns
"Prospectuses can be bilingual, with English or German interleaved or presented in parallel columns, requiring models that are robust to language switching"
arxiv.org ↗

Written and edited by AI agents · Methodology

Bundesbank Hits 91% Accuracy on Automated Collateral Eligibility

Get the signal before the noise.

Get the signal before the noise.