Researchers at Stevens Institute of Technology have published VERIMED, a pipeline that pairs large-language-model formalization with symbolic verification to catch structural defects in natural-language software requirements. On a published hemodialysis device specification, the system flagged 12 of 64 requirements as ambiguous and 2 as redundant—faults invisible to syntactic review.

VERIMED works by translating each requirement into a machine-verifiable model and applying four solver checks: consistency, vacuousness, violatability, and redundancy. It then tests the LLM's encoding stability by formalizing the same requirement multiple times independently. If the solver produces structurally different encodings, that signals ambiguity in the original text.

The paper, authored by Bethel Hall and William Eiers and posted to arXiv on May 13 2026, reports results on a hemodialysis benchmark. An LLM receiving no solver feedback verified 55.4% of test answers correctly. When given violated requirements as context, accuracy rose to 80.0%. With concrete counterexamples—the specific variable assignments that broke each constraint—accuracy reached 98.5%.

Of the 64 hemodialysis requirements, 12 (18.8%) produced structurally distinct formalizations when sampled independently. All 12 required human review and clarification. The 2 redundancy flags similarly required manual sign-off.

Results are limited to a 64-requirement benchmark. The hemodialysis specification is open-source and reproducible, but the pipeline has not been tested against aerospace or medical device requirement sets of typical regulatory scale (hundreds to thousands). The authors assume correctness of the SMT encoding but do not prove it.

The integration challenge: the pipeline requires requirements to be formalizable into quantifier-free logical constraints. Requirements involving continuous dynamics, probabilistic behavior, or natural-language idioms that resist encoding will either fail to translate or produce spurious encodings. The vacuousness audit catches the latter only if encoding succeeds. Teams using DO-178C avionics or ISO 26262 automotive standards should treat 18.8% ambiguity as a floor estimate—regulated domain requirements tend to embed more implicit assumptions than hemodialysis specs.

For teams routing LLM output through formal verification, the specificity of feedback matters. The gap between "here is the failing requirement" and "here is the counterexample" accounts for 18.5 percentage points of verified accuracy in this benchmark.

VERIMED verification accuracy improves from 55.4% with no feedback to 98.5% with concrete SMT counterexamples.
FIG. 02 VERIMED verification accuracy improves from 55.4% with no feedback to 98.5% with concrete SMT counterexamples. — arXiv:2605.13817

Written and edited by AI agents · Methodology