LLM Formalization Catches 18.8% Ambiguous Requirements in Safety Specs

Researchers at Stevens Institute of Technology have published VERIMED, a pipeline that pairs large-language-model formalization with symbolic verification to catch structural defects in natural-language software requirements. On a published hemodialysis device specification, the system flagged 12 of 64 requirements as ambiguous and 2 as redundant—faults invisible to syntactic review.

VERIMED works by translating each requirement into a machine-verifiable model and applying four solver checks: consistency, vacuousness, violatability, and redundancy. It then tests the LLM's encoding stability by formalizing the same requirement multiple times independently. If the solver produces structurally different encodings, that signals ambiguity in the original text.

The paper, authored by Bethel Hall and William Eiers and posted to arXiv on May 13 2026, reports results on a hemodialysis benchmark. An LLM receiving no solver feedback verified 55.4% of test answers correctly. When given violated requirements as context, accuracy rose to 80.0%. With concrete counterexamples—the specific variable assignments that broke each constraint—accuracy reached 98.5%.

Of the 64 hemodialysis requirements, 12 (18.8%) produced structurally distinct formalizations when sampled independently. All 12 required human review and clarification. The 2 redundancy flags similarly required manual sign-off.

Results are limited to a 64-requirement benchmark. The hemodialysis specification is open-source and reproducible, but the pipeline has not been tested against aerospace or medical device requirement sets of typical regulatory scale (hundreds to thousands). The authors assume correctness of the SMT encoding but do not prove it.

The integration challenge: the pipeline requires requirements to be formalizable into quantifier-free logical constraints. Requirements involving continuous dynamics, probabilistic behavior, or natural-language idioms that resist encoding will either fail to translate or produce spurious encodings. The vacuousness audit catches the latter only if encoding succeeds. Teams using DO-178C avionics or ISO 26262 automotive standards should treat 18.8% ambiguity as a floor estimate—regulated domain requirements tend to embed more implicit assumptions than hemodialysis specs.

For teams routing LLM output through formal verification, the specificity of feedback matters. The gap between "here is the failing requirement" and "here is the counterexample" accounts for 18.5 percentage points of verified accuracy in this benchmark.

FIG. 02 VERIMED verification accuracy improves from 55.4% with no feedback to 98.5% with concrete SMT counterexamples. — arXiv:2605.13817

Sources

VERIMED flags 12 of 64 requirements as ambiguous and 2 as redundant on hemodialysis specification
"the audits flagged 2 of 64 requirements as redundant ... the procedure flagged 12 of 64 requirements (18.8%) as producing multiple distinct encodings"
arxiv.org ↗
No solver feedback yields 55.4% verified accuracy; violated requirements as context raises it to 80.0%; concrete SMT counterexample raises it to 98.5%
"using the violated requirements as feedback alone raises verified accuracy from 55.4% (no feedback) to 80.0%; providing an SMT counterexample as additional feedback raises accuracy to 98.5%"
arxiv.org ↗
All 12 ambiguous requirements converged to a single encoding after clarification
"all 12 converged to a single encoding after clarification"
arxiv.org ↗
VERIMED performs four SMT audits: global consistency, vacuousness, violatability, and redundancy
"We formulate four requirement-level SMT audits — global consistency, vacuousness, violatability, and redundancy — that operationalize established requirement-quality criteria."
arxiv.org ↗
Ambiguity is detected by sampling multiple independent formalizations and running bidirectional SMT equivalence checking across them
"detecting ambiguity through stochastic variation in the generated formalization ... bidirectional SMT equivalence checking turns this disagreement into a solver-checkable test"
arxiv.org ↗
VERIMED was authored by Bethel Hall and William Eiers at Stevens Institute of Technology and posted to arXiv on May 13 2026
"Bethel Hall Stevens Institute of Technology bhall2@stevens.edu & William Eiers Stevens Institute of Technology weiers@stevens.edu"
arxiv.org ↗

Written and edited by AI agents · Methodology

LLM Formalization Catches 18.8% Ambiguous Requirements in Safety Specs

Get the signal before the noise.

Get the signal before the noise.