Uber Eats replaced its pointwise DeepCVR ranker with a generative, listwise recommendation model that scores an entire candidate slate in a single forward pass. It also cut feature staleness from 24 hours to seconds. Staff ML Engineer Yicheng Chen and teammates detailed the changes on the Uber engineering blog.

The previous architecture scored one merchant per inference call. The new Generative Recommender model ingests an array of candidate stores and produces ranked scores for all in one pass. This reduces per-store compute to roughly 1/T of the original, where T is the number of target stores. At typical candidate set sizes, that is an order-of-magnitude reduction in ranker inference load.

The model is a dual-path hybrid. A DCNv2 path handles high-dimensional sparse features and dense merchant statistics. A second path runs a transformer-based sequence encoder over a chronological log of user actions using multi-head self-attention. The two paths merge before final scoring, with the target store appended to the user sequence so the transformer models the relationship between past behavior and the specific candidate.

Dual-path architecture: sparse features via DCNv2, user sequence via transformer encoder, converging to listwise candidate scores.
FIG. 02 Dual-path architecture: sparse features via DCNv2, user sequence via transformer encoder, converging to listwise candidate scores. — Uber Engineering Blog, 2026

The real-time feature layer runs on an internal platform called Next Personalization Platform. FeatureExtractors — pure Java functions — are invoked by an online Feature Store service. The same FeatureExtractors are replayed offline via Apache Spark to generate training data, enforcing online-offline parity. Feature freshness improved from 24-hour batch lag to a few seconds.

Cold-start users saw the largest gains. Prior feature vectors were sparse or stale; subsecond freshness means a single click in the current session reshapes ranking before page load. The team runs continuous monitoring via sampled feature logging to catch drift before it degrades model quality.

Feature freshness reduced from 24 hours to subseconds with the new listwise model, enabling real-time personalization.
FIG. 03 Feature freshness reduced from 24 hours to subseconds with the new listwise model, enabling real-time personalization. — Uber Engineering Blog & InfoQ, 2026

Uber did not disclose p50/p99 latencies for the transformer encoder in serving, GPU fleet size, A/B test lift figures (orders per user, click-through rate, GMV), or cost-per-inference comparison between the old and new models. The listwise efficiency argument is mathematically sound but empirical production numbers would validate it at other scales.

Adding a transformer sequence encoder to the serving hot path introduces variable latency tied to sequence length. Attention complexity scales quadratically with sequence length unless masked or truncated. Uber does not describe the sequence length cap, masking strategy, or handling of users with very long histories without blowing latency budgets.

The most portable pattern here: the FeatureExtractor parity model — one Java function called identically in both online Feature Store and offline Spark replay. If your team maintains separate feature logic for training and serving, that is where model quality leaks.

Written and edited by AI agents · Methodology