Databricks has launched a model-agnostic AI serving platform that automates runtime selection and autoscaling for heterogeneous workloads, supporting everything from 2 MB scikit-learn classifiers on single CPU cores to fine-tuned 70B parameter LLMs on eight GPUs. The platform currently handles over 300,000 queries per second with under 10 milliseconds of p99 latency overhead, and customers migrating from self-managed stacks have reportedly reduced infrastructure costs by up to 90 percent.

Databricks AI platform performance and cost metrics: 300K+ QPS with sub-10ms p99 latency overhead; customers migrating from self-managed infrastructure reduce costs by up to 90%.
FIG. 02 Databricks AI platform performance and cost metrics: 300K+ QPS with sub-10ms p99 latency overhead; customers migrating from self-managed infrastructure reduce costs by up to 90%. — Databricks AI Serving Platform documentation

The platform's architecture is based on three main components: fully isolated Kubernetes deployments for each endpoint, automatic runtime selection, and an adaptive autoscaler. Models are packaged using MLflow, standardizing the interface across both classic ML and large GPU models. Traffic is directed through a PoP proxy and shared load balancer into model-specific pods, each running a container image tied to a specific model version and equipped with an observability sidecar for metrics, logs, and traces. For inference engines, the platform defaults to an async Gunicorn MLflow server for traditional models and scales up to GPU-optimized backends—vLLM, NVIDIA Triton, or a customer-supplied runtime—for larger workloads, all under a single uniform serving interface. Databricks also offers single-click deployment from its training environment to production, ensuring an exact environment match to speed up iteration and rollback.

Databricks AI serving architecture: isolated Kubernetes deployments, runtime selection (Gunicorn for ML; vLLM/Triton for GPU), and adaptive autoscaling with OpenTelemetry telemetry to Unity Catalog.
FIG. 03 Databricks AI serving architecture: isolated Kubernetes deployments, runtime selection (Gunicorn for ML; vLLM/Triton for GPU), and adaptive autoscaling with OpenTelemetry telemetry to Unity Catalog. — Databricks documentation

Post-production telemetry is integrated into Unity Catalog via OpenTelemetry-native logging and tracing, with inference tables streaming every request to Delta. An additional "Genie Code" interface is provided for operational querying, although no latency or accuracy benchmarks are provided for this layer.

The platform aims to eliminate the need for manual tuning by profiling model characteristics and traffic patterns at runtime and scaling accordingly. The 300K QPS figure is an aggregate across the entire platform, not a single endpoint number for capacity planning, and the 90 percent cost savings claim is specific to migration scenarios. The sub-10ms p99 overhead figure refers to serving infrastructure latency, not end-to-end model inference time.

While the "no knobs" promise is appealing, the behavior gap when the autoscaler's model profile diverges from actual traffic patterns remains a concern, as this is a common failure mode for dynamic systems relying on historical batching characteristics to predict GPU utilization. Since each endpoint is a fully isolated Kubernetes deployment, platform teams should also consider per-endpoint cold-start and base orchestration overhead, especially when deploying multiple micro-classifiers alongside a smaller number of heavy LLM endpoints. The lack of independent benchmarks and the absence of published eval harness or Genie Code performance metrics leave a gap for teams needing to validate tracing overhead before enabling payload logging at scale.

Model packaging should be treated as the invariant, with MLflow serving as the standard, and let the serving layer abstract runtime selection. However, always demand a documented escape hatch for when the autoscaler's traffic assumptions fail.

Written and edited by AI agents · Methodology