Retail AI

Why LLMs Alone Are Not Enough for Retail AI

Primoz Zajsek

There is a gap in retail AI that almost nobody is talking about. Not the gap between hype and reality, which gets plenty of coverage, but the gap between a retail AI that looks impressive and one that can actually be held accountable for its recommendations. These are not the same thing, and the difference matters more than most retailers evaluating AI solutions realise.

The probability problem

Large language models are probability machines. That is not a criticism, it is a description of how they work. An LLM generates responses by predicting the most likely next token given everything that came before it. It has been trained on vast amounts of text and has learned patterns: which words follow which, which concepts relate to which, which products tend to be associated with which queries. This makes LLMs extraordinarily capable at language tasks. They understand context, handle ambiguity, and generate fluent, natural responses, and for most applications probabilistic generation is exactly what you want.

Retail product recommendations are a different kind of problem. When a shopper says they need a cordless drill for assembling flat-pack furniture with a budget of €150, there is a correct answer. It is not the most probable product associated with the phrase "cordless drill for furniture assembly." It is the product that matches the actual requirements, covering torque range, battery life, chuck size, and price, verified against what is actually in stock right now. An LLM does not verify. It predicts. It surfaces what it has learned is statistically associated with the query, and while the result often looks plausible, the system has no mechanism to guarantee correctness because it is not built around correctness. It is built around probability. In retail, plausible is not good enough. A confident recommendation for the wrong product does not just fail to convert. It damages trust in a way that is hard to recover from.

What happens when you layer AI on top of bad data

The probability problem is compounded by a second one that most retailers do not anticipate until they are already building. Most retailer catalogs are not structured for precise AI reasoning. They have been built over years by ingesting data from multiple suppliers, each with their own attribute names, formats, and completeness levels. One supplier calls it "battery capacity," another calls it "mAh," and a third buries it in the product description with no structured field at all. The result is a catalog that looks comprehensive but is full of gaps, inconsistencies, and attributes that mean different things in different contexts.

When an LLM-based shopping assistant encounters missing or inconsistent attributes, it fills the gaps the same way it fills any other gap, by making an educated guess based on patterns in its training data. The guess is often reasonable and rarely verifiable, and when it is wrong there is no audit trail that explains why. This is why so many retail AI deployments underperform in production relative to the demo: the demo used clean data and production uses real data. A reliable AI shopping experience cannot be built on top of an unstructured catalog. The data is the foundation, and if it is not right, nothing built on top of it is reliable.

What enterprise-grade actually means

Enterprise-grade is a phrase that gets used freely in AI marketing and rarely defined. In the context of retail AI it has a specific meaning: the system can be held accountable for every recommendation it makes.

Accountability means three things in practice. Every recommendation is grounded in real catalog data, not a semantic approximation or a description indexed last week, but the actual structured product data including attributes, inventory status, and business rules, queried at the moment the shopper asks. If the system says a product matches the shopper's requirements, it matches them, verified rather than inferred. Every recommendation is also explainable: the shopper can see why a product was recommended, the retailer can audit the logic, and if a recommendation is wrong you can trace exactly where it broke down. This is not just good UX. It is increasingly a requirement under the EU AI Act's transparency provisions, which are already in effect for high-risk AI systems and will extend further. Finally, the system handles uncertainty correctly. When a requirement is ambiguous it asks a clarifying question before filtering. When no products match all requirements it says so honestly rather than surfacing an approximate result. When a product is out of stock it does not appear.

None of this is possible if the AI is operating through probabilistic generation on top of unstructured data. It requires a different architecture, one that separates the language understanding layer from the catalog reasoning layer and applies exact matching logic at the point where recommendations are generated.

The decision layer architecture

The resolution is to stop using the AI as the recommendation engine and start using it as the translation layer. The AI's job is to understand the shopper, to take a vague natural language intent and translate it into a structured set of requirements covering which attributes matter, what the constraints are, and what the shopper is actually trying to achieve. This is where LLMs are genuinely excellent, and using them for this job plays to their strengths.

The recommendation job belongs to a different layer, one that takes those structured requirements and runs them as precise queries against the actual catalog database. Exact attribute filters, real match counts, inventory verified, business rules applied. The output is not a ranked list of probably-relevant products. It is a set of products that satisfy every stated requirement, verified against the live catalog at the moment of the recommendation. The language layer handles conversation and the decision layer handles matching. The result is a system where recommendations can be verified, audited, and trusted consistently.

Why this matters beyond conversion

There is a near-term business case for this architecture: better recommendations convert better, and a shopper who receives an accurate recommendation grounded in their specific requirements is more likely to buy than one who receives a plausible guess. But there is also a longer-term case worth understanding.

The EU AI Act is creating legal obligations around AI transparency and explainability that are already in effect for some applications and expanding. Retailers deploying AI shopping tools will increasingly need to demonstrate that their systems can explain their recommendations, that they do not hallucinate product attributes, and that the logic is auditable. Probabilistic recommendation systems are not built for this. A system that generates recommendations through pattern matching can tell you what it recommended but cannot tell you why in terms that are verifiable against actual product data. Retailers building on accountable AI architecture now are not just making a better product decision. They are building toward compliance with a regulatory environment that is moving in one direction.

The catalog is not optional

Everything described above depends entirely on catalog data that is structured well enough to query precisely. Most retail catalogs are not there yet, and getting them there is not a side project. It is the prerequisite. Without structured product data there is no foundation for precise attribute queries, and without precise attribute queries there is no accountable recommendation. The AI will fill the gaps with probability and the problems described above will follow. Catalog enrichment is not a separate problem from AI shopping. It is the same problem tackled in sequence: structure the data first, then build the AI layer on top of it.

Gem is built on this architecture. The Gem Platform structures and enriches your catalog data. The Gem Engine runs exact attribute queries against it. The Gem Shopping Expert surfaces the result to shoppers with no hallucinations, no guesswork, and full auditability. Book a demo to see how it works on your catalog.