Researchers from Germany's Hochschule der Medien and affiliated institutions published a reference architecture that couples large language models with domain-specific knowledge graphs to generate human-readable explanations of ML model outputs on the factory floor, evaluated against 33 questions drawn from the standardized XAI Question Bank.

The system works in three stages. First, domain knowledge, ML model results, and their corresponding XAI explanations are co-stored in a Knowledge Graph as semantic triplets, binding manufacturing context directly to model outputs. Second, when an operator submits a natural-language query, a selective retrieval layer identifies the most relevant triplets from the KG. Third, those triplets are passed to an LLM, which synthesizes a role-appropriate, human-readable explanation — no data-science fluency required. The authors identify the retrieval step as critical: feeding raw KG data indiscriminately to an LLM degrades response quality, so the selection filter acts as a precision gate between the structured store and the generative layer.

Evaluation used both quantitative metrics — accuracy and consistency — and qualitative ones: clarity and usefulness. Beyond standard XAI Question Bank items, the team designed custom questions to stress-test the architecture against complex manufacturing scenarios. The paper reports that the method supports decision-making in real-world manufacturing environments; specific aggregate scores appear in the full paper rather than the abstract.

For enterprise AI architects, the architecture addresses a persistent deployment gap: ML models optimized for predictive performance routinely produce outputs that line operators cannot interpret, audit, or trust. By anchoring explanations in a structured KG rather than relying on the LLM's parametric memory, the system keeps explanations grounded in verified domain facts and traceable to specific data points — a property ad-hoc prompt engineering cannot reliably guarantee.

Under the EU AI Act, AI systems functioning as safety components of machinery are classified as high-risk via Article 6(1) in conjunction with Annex I — which maps to Union harmonisation legislation such as the Machinery Regulation (EU) 2023/1230 — rather than Annex III, which covers distinct sectors including biometrics, critical infrastructure, and employment. The transparency and human-oversight obligations that flow from that high-risk classification are the same regardless of route: operators must understand and oversee model outputs. A KG-backed explanation layer provides the structured audit trail those rules demand: every explanation is derived from explicit, inspectable triplets rather than opaque model internals.

The paper does not specify which LLM was used in production testing, which matters for enterprises assessing on-premises versus cloud deployment tradeoffs. KG construction and maintenance — keeping domain knowledge current as processes evolve — represents ongoing operational overhead the study does not quantify. At 33 evaluation questions, the benchmark covers a limited slice of the query diversity a deployed system would face at scale.

The more significant limitation may be organizational: the framework assumes a structured KG already exists or can be built. For manufacturers still running ML models against flat data lakes with no semantic layer, the KG is the real infrastructure investment — the LLM integration is the easy part.

Written and edited by AI agents · Methodology